Michael Smith, Alex Frye, Chris Boomhower ----- 4/05/2017

Describe the purpose of the model you are about to build
The data set again selected by our group for Lab 3 consists of Citi Bike trip history data collected and released by NYC Bike Share, LLC and Jersey Bike Share, LLC under Citi Bike's NYCBS Data Use Policy. Citi Bike is America's largest bike share program, with 10,000 bikes and 600 stations across Manhattan, Brooklyn, Queens, and Jersey City... 55 neighborhoods in all. As such, our data set's trip history includes all rental transactions conducted within the NYC Citi Bike system from July 1st, 2013 to February 28th, 2014. These transactions amount to 5,562,293 trips within this time frame. The original data set includes 15 attributes. In addition to these 15, our team was able to derive 15 more attributes for use in our classification efforts, some attributes of which are NYC weather data which come from Carbon Dioxide Information Analysis Center (CDIAC). These data are merged with the Citi Bike data to provide environmental insights into rider behavior.
The trip data was collected via Citi Bike's check-in/check-out system among 330 of its stations in the NYC system as part of its transaction history log. While the non-publicized data likely includes further particulars such as rider payment details, the publicized data is anonymized to protect rider identity while simultaneously offering bike share transportation insights to urban developers, engineers, academics, statisticians, and other interested parties. The CDIAC data, however, was collected by the Department of Energy's Oak Ridge National Laboratory for research into global climate change. While basic weather conditions are recorded by CDIAC, as included in our fully merged data set, the organization also measures atmospheric carbon dioxide and other radiatively active gas levels to conduct their research efforts.
Our team has taken particular interest in this data set as some of our team members enjoy both recreational and commute cycling. By combining basic weather data with Citi Bike's trip data, our intent in this lab is to: 1) Fit clusters describing both customer and subscriber user types 2) Build a classification model for both customer and subscriber clusters 3) Predict customer clusters for subscriber observations and vice-versa providing a Classification-Ready dataset for predicting whether riders are more likely to be (or become) Citi Bike subscribers based on ride environmental conditions, the day of the week for his/her trip, trip start and end locations, the general time of day (i.e. morning, midday, afternoon, evening, night) of his/her trip, his/her age and gender, customer/subscriber cluster features, etc., and 4) provide a demonstration on how we would implement/deploy these clusters to a new data entry for use in a classification model real-time. Due to the exhaustive number of observations in the original data set (5,562,293), a sample of 500,000 is selected to achieve these goals (as described further in the sections below).
Describe the meaning and type of data
Before diving into each attribute in detail, one glaring facet of this data set that needs mentioning is its inherent time-series nature. By no means was this overlooked when we decided upon these particular data. To mitigate the effects of time on our analysis results, we have chosen to aggregate time-centric attributes such as dates and hours of the day by replacing them with simply the day of the week or period of the day (more on these details shortly). For example, by identifying trips occurring on July 1st, 2013, not by the date of occurrence but rather the day of the week, Monday, and identifying trips on July 2nd, 2013, as occurring on Tuesday, we will be able to obtain a "big picture" understanding of trends by day of the week instead of at the date-by-date level. We understand this is not a perfect solution since the time-series component is still an underlying factor in trip activity, but it is good enough to answer the types of questions we hope to target as described in the previous section as we will be comparing all Mondays against all Tuesdays, etc.
As mentioned previously, the original data set from Citi Bike included 15 attributes. These 15 attributes and associated descriptions are provided below:
It is important to note that birth year and gender details are not available for "Customer" user types but rather for "Subscriber" riders only. Fortunately, these are the only missing data values among all trips in the data set. Unfortunately, however, it means that we will not be able to identify the ratio of males-to-females that are not subscribed or use age to predict subcribers vs. non-subscribers (Customers). More to this end will be discussed in the next section.
It is also worth mentioning that while attributes such as trip duration, start and end stations, bike ID, and basic rider details were collected and shared with the general public, care was taken by Citi Bike to remove trips taken by staff during system service appointments and inspections, trips to or from "test" stations which were employed during the data set's timeframe, and trips lasting less than 60 seconds which could indicate false checkout or re-docking efforts during checkin.
Because some attributes may be deemed as duplicates (i.e. start_station_id, start_station_name, and start_station_latitude/longitude for identifying station locations), we chose to extract further attributes from the base attributes at hand. Further attributes were also extracted to mitigate the effects of time. In addition, we felt increased understanding could be obtained from combining weather data for the various trips as discussed in the previous section. These additional 10 attributes are described below:
After extracting our own attributes and merging weather data, the total number of attributes present in our final data set is 25. Only 15 are used throughout this lab, however, due to the duplicate nature of some attributes as discussed already. This final list of used attributes are tripduration, DayOfWeek, TimeOfDay, HolidayFlag, start_station_name, start_station_latitude, start_station_longitude, usertype, gender, Age, PRCP, SNOW, TAVE, TMAX, and TMIN.
To begin our analysis, we need to load the data from our source .csv files. Steps taken to pull data from the various source files are as follows:
Below you will see this process, as well as import/options for needed python modules throughout this analysis.
import os
from geopy.distance import vincenty
import holidays
from datetime import datetime
from dateutil.parser import parse
import glob
import pandas as pd
import numpy as np
from IPython.display import display
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import gmaps
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt
from sklearn.cross_validation import ShuffleSplit
from sklearn.metrics import log_loss
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from itertools import cycle
from sklearn.metrics import roc_curve, auc
from scipy import interp
from sklearn.metrics import confusion_matrix
import itertools
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor
import statsmodels.stats.api as sms
from sklearn.cluster import DBSCAN
from sklearn.cluster import SpectralClustering
from sklearn.neighbors import kneighbors_graph
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
import matplotlib.cm as cmx
%load_ext memory_profiler
plt.rcParams['figure.figsize'] = (12, 6)
pd.options.mode.chained_assignment = None
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
import pickle
def pickleObject(objectname, filename, filepath = "PickleFiles/"):
fullpicklepath = "{0}{1}.pkl".format(filepath, filename)
# Create a variable to pickle and open it in write mode
picklefile = open(fullpicklepath, 'wb')
pickle.dump(objectname, picklefile)
picklefile.close()
def unpickleObject(filename, filepath = "PickleFiles/"):
fullunpicklepath = "{0}{1}.pkl".format(filepath, filename)
# Create an variable to pickle and open it in write mode
unpicklefile = open(fullunpicklepath, 'rb')
unpickleObject = pickle.load(unpicklefile)
unpicklefile.close()
return unpickleObject
def clear_display():
from IPython.display import clear_output
############################################################
# Load & Merge Data from Source Files
# Parse into Compiled Files
############################################################
starttime = datetime.now()
print('Starting Source Data Load & Merge Process. \n'
'Start Time: ' + str(starttime))
if os.path.isfile("Compiled Data/dataset1.csv"):
print("Found the File!")
else:
citiBikeDataDirectory = "Citi Bike Data"
citiBikeDataFileNames = [
"2013-07 - Citi Bike trip data - 1.csv",
"2013-07 - Citi Bike trip data - 2.csv",
"2013-08 - Citi Bike trip data - 1.csv",
"2013-08 - Citi Bike trip data - 2.csv",
"2013-09 - Citi Bike trip data - 1.csv",
"2013-09 - Citi Bike trip data - 2.csv",
"2013-10 - Citi Bike trip data - 1.csv",
"2013-10 - Citi Bike trip data - 2.csv",
"2013-11 - Citi Bike trip data - 1.csv",
"2013-11 - Citi Bike trip data - 2.csv",
"2013-12 - Citi Bike trip data.csv",
"2014-01 - Citi Bike trip data.csv",
"2014-02 - Citi Bike trip data.csv"
]
weatherDataFile = "Weather Data/NY305801_9255_edited.txt"
citiBikeDataRaw = []
for file in citiBikeDataFileNames:
print(file)
filepath = citiBikeDataDirectory + "/" + file
with open(filepath) as f:
lines = f.read().splitlines()
lines.pop(0) # get rid of the first line that contains the column names
for line in lines:
line = line.replace('"', '')
line = line.split(",")
sLatLong = (line[5], line[6])
eLatLong = (line[9], line[10])
distance = vincenty(sLatLong, eLatLong).miles
line.extend([distance])
## Monday = 0
## Tuesday = 1
## Wednesday = 2
## Thursday = 3
## Friday = 4
## Saturday = 5
## Sunday = 6
if parse(line[1]).weekday() == 0:
DayOfWeek = "Monday"
elif parse(line[1]).weekday() == 1:
DayOfWeek = "Tuesday"
elif parse(line[1]).weekday() == 2:
DayOfWeek = "Wednesday"
elif parse(line[1]).weekday() == 3:
DayOfWeek = "Thursday"
elif parse(line[1]).weekday() == 4:
DayOfWeek = "Friday"
elif parse(line[1]).weekday() == 5:
DayOfWeek = "Saturday"
else:
DayOfWeek = "Sunday"
line.extend([DayOfWeek])
##Morning 5AM-10AM
##Midday 10AM-2PM
##Afternoon 2PM-5PM
##Evening 5PM-10PM
##Night 10PM-5AM
if parse(line[1]).hour >= 5 and parse(line[1]).hour < 10:
TimeOfDay = 'Morning'
elif parse(line[1]).hour >= 10 and parse(line[1]).hour < 14:
TimeOfDay = 'Midday'
elif parse(line[1]).hour >= 14 and parse(line[1]).hour < 17:
TimeOfDay = 'Afternoon'
elif parse(line[1]).hour >= 17 and parse(line[1]).hour < 22:
TimeOfDay = 'Evening'
else:
TimeOfDay = 'Night'
line.extend([TimeOfDay])
## 1 = Yes
## 0 = No
if parse(line[1]) in holidays.UnitedStates():
holidayFlag = "1"
else:
holidayFlag = "0"
line.extend([holidayFlag])
citiBikeDataRaw.append(line)
del lines
with open(weatherDataFile) as f:
weatherDataRaw = f.read().splitlines()
weatherDataRaw.pop(0) # again, get rid of the column names
for c in range(len(weatherDataRaw)):
weatherDataRaw[c] = weatherDataRaw[c].split(",")
# Adjust days and months to have a leading zero so we can capture all the data
if len(weatherDataRaw[c][2]) < 2:
weatherDataRaw[c][2] = "0" + weatherDataRaw[c][2]
if len(weatherDataRaw[c][0]) < 2:
weatherDataRaw[c][0] = "0" + weatherDataRaw[c][0]
citiBikeData = []
while (citiBikeDataRaw):
instance = citiBikeDataRaw.pop()
date = instance[1].split(" ")[0].split("-") # uses the start date of the loan
for record in weatherDataRaw:
if (str(date[0]) == str(record[4]) and str(date[1]) == str(record[2]) and str(date[2]) == str(record[0])):
instance.extend([record[5], record[6], record[7], record[8], record[9]])
citiBikeData.append(instance)
del citiBikeDataRaw
del weatherDataRaw
# Final Columns:
# 0 tripduration
# 1 starttime
# 2 stoptime
# 3 start station id
# 4 start station name
# 5 start station latitude
# 6 start station longitude
# 7 end station id
# 8 end station name
# 9 end station latitude
# 10 end station longitude
# 11 bikeid
# 12 usertype
# 13 birth year
# 14 gender
# 15 start/end station distance
# 16 DayOfWeek
# 17 TimeOfDay
# 18 HolidayFlag
# 19 PRCP
# 20 SNOW
# 21 TAVE
# 22 TMAX
# 23 TMIN
maxLineCount = 250000
lineCounter = 1
fileCounter = 1
outputDirectoryFilename = "Compiled Data/dataset"
f = open(outputDirectoryFilename + str(fileCounter) + ".csv", "w")
for line in citiBikeData:
if lineCounter == 250000:
print(f)
f.close()
lineCounter = 1
fileCounter = fileCounter + 1
f = open(outputDirectoryFilename + str(fileCounter) + ".csv", "w")
f.write(",".join(map(str, line)) + "\n")
lineCounter = lineCounter + 1
del citiBikeData
endtime = datetime.now()
print('Ending Source Data Load & Merge Process. \n'
'End Time: ' + str(starttime) + '\n'
'Total RunTime: ' + str(endtime - starttime))
Now that we have compiled data files from both CitiBike and the weather data, we want to load that data into a Pandas dataframe for analysis. We iterate and load each file produced above, then assign each column with their appropriate data types. Additionally, we compute the Age Column after producing a default value for missing "Birth Year" values. This is discussed further in the Data Preparation 1 section.
%%time
############################################################
# Load the Compiled Data from CSV
############################################################
# Create CSV Reader Function and assign column headers
def reader(f, columns):
d = pd.read_csv(f)
d.columns = columns
return d
# Identify All CSV FileNames needing to be loaded
path = r'Compiled Data'
all_files = glob.glob(os.path.join(path, "*.csv"))
# Define File Columns
columns = ["tripduration", "starttime", "stoptime", "start_station_id", "start_station_name",
"start_station_latitude",
"start_station_longitude", "end_station_id", "end_station_name", "end_station_latitude",
"end_station_longitude", "bikeid", "usertype", "birth year", "gender", "LinearDistance", "DayOfWeek",
"TimeOfDay", "HolidayFlag", "PRCP", "SNOW", "TAVE", "TMAX", "TMIN"]
# Load Data
CitiBikeDataCompiled = pd.concat([reader(f, columns) for f in all_files])
# Replace '\N' Birth Years with Zero Values
CitiBikeDataCompiled["birth year"] = CitiBikeDataCompiled["birth year"].replace(r'\N', '0')
# Convert Columns to Numerical Values
CitiBikeDataCompiled[['tripduration', 'birth year', 'LinearDistance', 'PRCP', 'SNOW', 'TAVE', 'TMAX', 'TMIN']] \
= CitiBikeDataCompiled[['tripduration', 'birth year', 'LinearDistance', 'PRCP', 'SNOW', 'TAVE', 'TMAX',
'TMIN']].apply(pd.to_numeric)
# Convert Columns to Date Values
CitiBikeDataCompiled[['starttime', 'stoptime']] \
= CitiBikeDataCompiled[['starttime', 'stoptime']].apply(pd.to_datetime)
# Compute Age: 0 Birth Year = 0 Age ELSE Compute Start Time Year Minus Birth Year
CitiBikeDataCompiled["Age"] = np.where(CitiBikeDataCompiled["birth year"] == 0, 0,
CitiBikeDataCompiled["starttime"].dt.year - CitiBikeDataCompiled[
"birth year"])
# Convert Columns to Str Values
CitiBikeDataCompiled[['start_station_id', 'end_station_id', 'bikeid', 'HolidayFlag', 'gender']] \
= CitiBikeDataCompiled[['start_station_id', 'end_station_id', 'bikeid', 'HolidayFlag', 'gender']].astype(str)
%%time
print(len(CitiBikeDataCompiled))
display(CitiBikeDataCompiled.head())
When analyzing our final dataset for accurate measures, there are a few key factors we can easily verify/research:
Computational Accuracy: Ensure data attributes added by computation are correct
Missing Data from Source
Although we are able to research these many factors, one computation may still be lacking information in this dataset. Our LinearDistance attribute computes the distance from one lat/long coordinate to another. This attribute does not however tell us the 'true' distance a biker traveled before returning the bike. Some bikers may be biking for exercise around the city with various turns and loops, whereas others travel the quickest path to their destination. Because our dataset limits us to start and end locations, we do not have enough information to accurately compute distance traveled. Because of this, we have named the attribute "LinearDistance" rather than "DistanceTraveled".
Below we will walk through the process of researching the 'Measureable' data quality factors mentioned above.
To help mitigate challenges with time series data, we have chosen to break TimeOfDay into 5 categories. These Categories are broken down below:
To ensure that these breakdowns are accurately computed, we pulled the distinct list of TimeOfDay assignments by starttime hour. Looking at the results below, we can verify that this categorization is correctly being assigned.
%%time
# Compute StartHour from StartTime
CitiBikeDataCompiled["StartHour"] = CitiBikeDataCompiled["starttime"].dt.hour
# Compute Distinct Combinations of StartHour and TimeOfDay
DistinctTimeOfDayByHour = CitiBikeDataCompiled[["StartHour", "TimeOfDay"]].drop_duplicates().sort_values("StartHour")
# Print
display(DistinctTimeOfDayByHour)
#Clean up Variables
del CitiBikeDataCompiled["StartHour"]
In order to verify our computed DayOfWeek column, we have chosen one full week from 12/22/2013 - 12/28/2013 to validate. Below is a calendar image of this week to baseline our expected results:

To verify these 7 days, we pulled the distinct list of DayOfWeek assignments by StartDate (No Time). If we can verify one full week, we may justify that the computation is correct across the entire dataset. Looking at the results below, we can verify that this categorization is correctly being assigned.
%%time
# Create DataFrame for StartTime, DayOfWeek within Date Threshold
CitiBikeDayOfWeekTest = CitiBikeDataCompiled[(CitiBikeDataCompiled['starttime'].dt.year == 2013)
& (CitiBikeDataCompiled['starttime'].dt.month == 12)
& (CitiBikeDataCompiled['starttime'].dt.day >= 22)
& (CitiBikeDataCompiled['starttime'].dt.day <= 28)][
["starttime", "DayOfWeek"]]
# Create FloorDate Variable as StartTime without the timestamp
CitiBikeDayOfWeekTest["StartFloorDate"] = CitiBikeDayOfWeekTest["starttime"].dt.strftime('%m/%d/%Y')
# Compute Distinct combinations
DistinctDayOfWeek = CitiBikeDayOfWeekTest[["StartFloorDate", "DayOfWeek"]].drop_duplicates().sort_values(
"StartFloorDate")
#Print
display(DistinctDayOfWeek)
# Clean up Variables
del CitiBikeDayOfWeekTest
del DistinctDayOfWeek
Using the same week as was used to verify DayOfWeek, w can test whether HolidayFlag is set correctly for the Christmas Holiday. We pulled the distinct list of HolidayFlag assignments by StartDate (No Time). If we can verify one holiday, we may justify that the computation is correct across the entire dataset. Looking at the results below, we expect to see HolidayFlag = 1 only for 12/25/2013.
%%time
# Create DataFrame for StartTime, HolidayFlag within Date Threshold
CitiBikeHolidayFlagTest = CitiBikeDataCompiled[(CitiBikeDataCompiled['starttime'].dt.year == 2013)
& (CitiBikeDataCompiled['starttime'].dt.month == 12)
& (CitiBikeDataCompiled['starttime'].dt.day >= 22)
& (CitiBikeDataCompiled['starttime'].dt.day <= 28)][
["starttime", "HolidayFlag"]]
# Create FloorDate Variable as StartTime without the timestamp
CitiBikeHolidayFlagTest["StartFloorDate"] = CitiBikeHolidayFlagTest["starttime"].dt.strftime('%m/%d/%Y')
# Compute Distinct combinations
DistinctHolidayFlag = CitiBikeHolidayFlagTest[["StartFloorDate", "HolidayFlag"]].drop_duplicates().sort_values(
"StartFloorDate")
#Print
display(DistinctHolidayFlag)
# Clean up Variables
del CitiBikeHolidayFlagTest
del DistinctHolidayFlag
Accounting for missing data is a crucial part of our analysis. At first glance, it is very apparent that we have a large amount of missing data in the Gender and Birth Year attributes from our source CitiBike Data. We have already had to handle for missing Birth Year attributes while computing "Age" in our Data Load from CSV section of this paper. This was done to create a DEFAULT value of (0), such that future computations do not result in NA values as well. Gender has also already accounted for missing values with a default value of (0) by the source data. Although we have handled these missing values with a default, we want to ensure that we 'need' these records for further analysis - or if we may remove them from the dataset. Below you will see a table showing the frequency of missing values(or forced default values) by usertype. We noticed that of the 4,881,384 Subscribing Members in our dataset, only 295 of them were missing Gender information, whereas out of the 680,909 Customer Users (Non-Subscribing), there was only one observation where we had complete information for both Gender and Birth Year. This quickly told us that removing records with missing values is NOT an option, since we would lose data for our entire Customer Usertype. These attributes, as well as Age (Computed from birth year) will serve as difficult for use in a classification model attempting to predict usertype.
We have also looked at all other attributes, and verified that there are no additional missing values in our dataset. A missing value matrix was produced to identify if there were any gaps in our data across all attributes. Due to the conclusive results in our data, no missing values present, we removed this lackluster visualization from the report.
%%time
NADatatestData = CitiBikeDataCompiled[["usertype","gender", "birth year"]]
NADatatestData["GenderISNA"] = np.where(CitiBikeDataCompiled["gender"] == '0', 1, 0)
NADatatestData["BirthYearISNA"] = np.where(CitiBikeDataCompiled["birth year"] == 0, 1,0)
NAAggs = pd.DataFrame({'count' : NADatatestData.groupby(["usertype","GenderISNA", "BirthYearISNA"]).size()}).reset_index()
display(NAAggs)
del NAAggs
To ensure that there are no duplicate records in our datasets, we ensured that the number of records before and after removing potential duplicates were equal to each other. This test passed, thus we did not need any alterations to the dataset based on duplicate records.
%%time
len(CitiBikeDataCompiled) == len(CitiBikeDataCompiled.drop_duplicates())
Trip Duration
In analyzing a Box Plot on trip duration values, we find extreme outliers present. With durations reaching up to 72 days in the most extreme instance, our team decided to rule out any observation with a duration greater than a 24 hour period. The likelihood of an individual sleeping overnight after their trip with the bike still checked out is much higher after the 24 hour period. This fact easily skews the results of this value, potentially hurting any analysis done. We move forward with removing a total of 457 observations based on trip duration greater than 24 hours (86,400 seconds).
%%time
%matplotlib inline
#CitiBikeDataCompiledBackup = CitiBikeDataCompiled
#CitiBikeDataCompiled = CitiBikeDataCompiledBackup
# BoxPlot tripDuration - Heavy Outliers!
sns.boxplot(y = "tripduration", data = CitiBikeDataCompiled)
sns.despine()
# How Many Greater than 24 hours?
print(len(CitiBikeDataCompiled[CitiBikeDataCompiled["tripduration"]>86400]))
# Remove > 24 Hours
CitiBikeDataCompiled = CitiBikeDataCompiled[CitiBikeDataCompiled["tripduration"]<86400]
Once outliers are removed, we run the boxplot again, still seeing skewness in results. To try to mitigate this left-skew distribution, we decide to take a log transform on this attribute.
%%time
%matplotlib inline
# BoxPlot Trip Duration AFTER removal of outliers
sns.boxplot(y = "tripduration", data = CitiBikeDataCompiled)
sns.despine()
# Log Transform Column Added
CitiBikeDataCompiled["tripdurationLog"] = CitiBikeDataCompiled["tripduration"].apply(np.log)
%%time
%matplotlib inline
# BoxPlot TripDurationLog
sns.boxplot(y = "tripdurationLog", data = CitiBikeDataCompiled)
sns.despine()
Age
Similarly, we look at the distribution of Age in our dataset. Interestingly, it seems we have several outlier observations logging their birth year far enough back to cause their age to compute as 115 years old. Possible reasons for these outlier ages could be data entry errors by those who do not enjoy disclosing personal information, or possibly account sharing between a parent and a child - rendering an inaccurate data point to those actually taking the trip. Our target demographic for this study are those individuals under 65 years of age, given that they are the likely age groups to be in better physical condition for the bike share service. Given this target demographic, and the poor entries causing extreme outliers, we have chosen to limit out dataset to observations up to 65 years of age. This change removed an additional 53824 records from the dataset.
%%time
%matplotlib inline
# BoxPlot Age - Outliers!
sns.boxplot(y = "Age", data = CitiBikeDataCompiled[CitiBikeDataCompiled["Age"]!= 0])
sns.despine()
# How Many Greater than 65 years old?
print(len(CitiBikeDataCompiled[CitiBikeDataCompiled["Age"]>65]))
# Remove > 65 years old
CitiBikeDataCompiled = CitiBikeDataCompiled[CitiBikeDataCompiled["Age"]<=65]
%%time
%matplotlib inline
# BoxPlot Age - removed Outliers!
sns.boxplot(y = "Age", data = CitiBikeDataCompiled[CitiBikeDataCompiled["Age"]!= 0])
sns.despine()
Given the extremely large volume of data collected, we have have decided to try to sample down to ~ 1/10th of the original dataset for a total of 500,000 records. Before taking this action, however, we wanted to ensure that we keep data proportions reasonable for analysis and ensure we do not lose any important demographic in our data.
Below we compute the percentage of our Dataset that comprises of Customers vs. Subscribers. We note, that 87.6% of the data consists of Subscriber users whereas the remaining 12.4% resemble Customers.
%%time
%matplotlib inline
UserTypeDist = pd.DataFrame({'count' : CitiBikeDataCompiled.groupby(["usertype"]).size()}).reset_index()
display(UserTypeDist)
UserTypeDist.plot.pie(y = 'count', labels = ['Customer', 'Subscriber'], autopct='%1.1f%%')
In our Sample Dataset for this analysis, we have chosen to oversample the Customer observations to force a 50/50 split between the two classifications. This will help reduce bias in the model towards Subscribers simply due to the distribution of data in the sample.
We are able to compute the sample size for each usertype and then take a random sample within each group. Below you will see that our sampled distribution matches the chosen 50/50 split between Customers and Subscriber Usertypes.
%%time
SampleSize = 500000
CustomerSampleSize_Seed = int(round(SampleSize * 50.0 / 100.0,0))
SubscriberSampleSize_Seed = int(round(SampleSize * 50.0 / 100.0,0))
CitiBikeCustomerDataSampled = CitiBikeDataCompiled[CitiBikeDataCompiled["usertype"] == 'Customer'].sample(n=CustomerSampleSize_Seed, replace = False, random_state = CustomerSampleSize_Seed)
CitiBikeSubscriberDataSampled = CitiBikeDataCompiled[CitiBikeDataCompiled["usertype"] == 'Subscriber'].sample(n=SubscriberSampleSize_Seed, replace = False, random_state = SubscriberSampleSize_Seed)
CitiBikeDataSampled_5050 = pd.concat([CitiBikeCustomerDataSampled,CitiBikeSubscriberDataSampled])
print(len(CitiBikeDataSampled_5050))
UserTypeDist = pd.DataFrame({'count' : CitiBikeDataSampled_5050.groupby(["usertype"]).size()}).reset_index()
display(UserTypeDist)
UserTypeDist.plot.pie(y = 'count', labels = ['Customer', 'Subscriber'], autopct='%1.1f%%')
To re-iterate, our main objectives in analyzing these data are to determine which attributes have greatest bearing on predicting a rider's type (Customer vs. Subscriber) and to gain a better understanding of rider behavior as a function of external factors. Many attributes in this data set may eventually be used in subsequent analyses to answer these questions. The primary attributes on which we will focus our attention in this section, however, are as follows:
Over the course of this section, we will review these top attributes in some detail and discuss the value of using our chosen visualizations. Note also that merged weather data is of significant interest as well.
Before discussing the following heatmap in detail, it is worth noting some special steps required to use the gmaps module in Python in case the reader is interested in rendering our code to plot data on top of Google's maps (Note full instructions are available at https://media.readthedocs.org/pdf/jupyter-gmaps/latest/jupyter-gmaps.pdf)
Besides having Jupyter Notebook installed on one's computer with extensions enabled (default if using Anaconda) and installing the gmaps module using pip, the following line should be run from within the command terminal. This is only to be done once and should be done when Jupyter Notebook is not running.
$ jupyter nbextension enable --py gmaps
In addition to running the above line in the command prompt, a Standard Google API user key will need obtained from https://developers.google.com/maps/documentation/javascript/get-api-key. This only needs done once and is necessary to pull the Google map data into the Jupyter Notebook environment. The key is entered in the gmaps.configure() line as shown in the below cell. We have provided our own private key in the meantime for the reader's convenience.
Now on to the data visualization... This geological heatmap visualization is interactive; however, the kernel must run the code block each time our Jupyter Notebook file is opened due to the API key requirement. Therefore, we've captured some interesting views to aid in our discussion and have included them as embedded images.
The start station heatmap represents the start station location data via attributes start_station_latitude and start_station_longitude. It identifies areas of highest and lowest concentration for trip starts. The location data is important as it helps us understand where the areas of highest activity are and, as will be seen in one of our later sections, will play an important role in identifying riders as regular customers or subscribers.
%%time
gmaps.configure(api_key="AIzaSyAsBi0MhgoQWfoGMSl5UcD-vR6H76cntxg") # Load private Google API key
locations = CitiBikeDataSampled_5050[['start_station_latitude', 'start_station_longitude']].values.tolist()
m = gmaps.Map()
heatmap_layer = gmaps.Heatmap(data = locations)
m.add_layer(heatmap_layer)
An overall view quickly reveals that station data was only provided for areas of NYC south of Manhattan and mostly north of Brooklyn. This could either mean that the bike share program had not yet expanded into these other areas at the time of data collection or that the data simply wasn't included (as mentioned previously, many test sites were being used during this time frame but CitiBike did not include them with this data set).
Within the range of trip start frequency from the least number of trips (green) to the most trips (red), green and yellow indicate low to medium trip activity in most areas. However, higher pockets of concentration do exist in some places. We will attempt to put this visualization to good use by focusing in on one of these hotspots.
m

A prominant heatspot occurs just east of Bryant Park and the Midtown map label. Zooming into this area (via regular Google Map controls as the rendered visual is interactive) allows for a closer look. A snapshot of this zoomed in image is embedded below. The hotspot seems slightly elongated and stands out from among the other stations. Zooming in further will help to understand why this is and may shed some light on the higher activity in this area.

Zooming in to this area further helps us see that two stations are very close together. Even so, why might there be such high rider activity at these stations? This higher activity is likely affected by the stations' proximity to the famous Grand Central Station. As commuters and recreationalists alike arrive by train at Grand Central, it is natural that many of them may choose to continue their journey via the two closest bike share stations nearby. When the northernmost bike share station runs out of bikes, riders likely go to the next station to begin their ride instead.
By understanding the dynamics of geographical activity within this data set and the amenities that surround each station, we will be able to more efficiently leverage the data to make our classification and regression predictions.

After visualizing the overall dataset locations with a heatmap over NYC, we decided to take the visualization one step further. This time, we broke the dataset into two segments: Customer vs. Subscriber. Below is two separate gmap heatmaps containing geographic densities for each usertype. What we found assisted our theories on customer vs. subscriber usage tendencies. Seen first, the Customer heatmap overall contains much fewer dense regions. This helps to confirm our suspicions infering Customer bikers as less "routine" than subscribing bikers. When looking around for the most dense region in this heatmap, one point stuck out as particularly interesting: The Zoo. When comparing this region on the Subscriber gmap, we did not see the same type of traffic! This helps assist our theories that customer bikers use the service more for events, shopping, or one-time use convenience. On the subscriber gmap, the most dense region, is that near the Grand Central Station as discussed earlier - assisting in the opposing theory for subscribing members as routine trips to work, groceries, etc. as they consistently use the bike share service as a means to reach the metro station.
Customer Users
%%time
customerData = CitiBikeDataSampled_5050.query('usertype == "Customer"')
customerLoc = customerData[['start_station_latitude', 'start_station_longitude']].values.tolist()
cmap = gmaps.Map()
customer_layer = gmaps.Heatmap(data=customerLoc)#, fill_color="red", stroke_color="red", scale=3)
cmap.add_layer(customer_layer)
cmap

Subscriber Users
%%time
subscriberData = CitiBikeDataSampled_5050.query('usertype == "Subscriber"')
subscriberLoc = subscriberData[['start_station_latitude', 'start_station_longitude']].values.tolist()
smap = gmaps.Map()
subscriber_layer = gmaps.Heatmap(data=subscriberLoc)#, fill_color="green", stroke_color="green", scale=2)
smap.add_layer(subscriber_layer)
smap

Because we were able to bring together historical weather data for the dates we had in our records, we wanted to explore the relationship these variables had with our usertype status. If subscribers were regularly using the bikes for commuting as we've begun to see, then weather wouldn't impact their rental stastics as much as customers who appear to be primarily opportunistic in their usage.
A quick cursory glance reveals a noticeable difference in bike rentals in regards to low temperatures, precipitation, and snowfall. While true, there are fewer customers than subscribers, we're concerned primarily with the spread or distribution of the plot points rather than the quantity. And we can see that on the customer pair plots that there are fewer points distributed across the lower temperature ranges and higher precipitation/snowfall ranges. The distributions pick back up at higher temperatures and lower precipitation points between the two usertypes.
If stations consistently see use during "bad" weather, then those stations could be identified as subscriber stations. Further, if certain customers are found making the same trips consistently in all weather types, then they could be pushed for subscription.
Customer Users
%%time
sns.pairplot(CitiBikeDataSampled_5050.query("usertype == 'Customer'"), x_vars=["PRCP","SNOW","TAVE","TMAX","TMIN"], y_vars=["tripduration","tripdurationLog","LinearDistance"])
Subscriber Users
%%time
sns.pairplot(CitiBikeDataSampled_5050.query("usertype == 'Subscriber'"), x_vars=["PRCP","SNOW","TAVE","TMAX","TMIN"], y_vars=["tripduration","tripdurationLog","LinearDistance"])
Almost universally, across every day of the week, customers appear to have a higher trip duration than subscribers. While additional analysis will be required to confirm this, it's possible that one explanation is that subscribers can freely take and return their bikes which means that they're more willing to make shorter trips versus customers that pay each time they want to rent a bike in the first place. An alternate explanation, based on what we know in regards to the relationship between trip duration and linear distance traveled, is that subscribers are using the bikes for commuting to and from specific locations. This would result in a lower trip duration than customers that might use their bikes for general travel around the city. This possibility is corroborated by the decrease in activity on the weekends by subscribers.
Identifying the point at which a customer might become a subscriber using this data would probably include monitoring weekday activity and trip duration. If a station has a lot of customers with trip durations similar to those of subscribers, then that station would be a good location to do a focused advertisement of the benefits of subscribing.
%%time
sns.set(style="whitegrid", palette="pastel", color_codes=True)
plt.rcParams['figure.figsize'] = (12, 6)
# Draw a nested violinplot and split the violins for easier comparison
sns.violinplot(x="DayOfWeek", y="tripdurationLog", hue="usertype", data=CitiBikeDataSampled_5050, split=True,
inner="quart", palette={"Subscriber": "g", "Customer": "y"})
sns.despine(left=True)
Unlike trip duration, the linear distance between start and end stations for both customers and subscribers appear to be similar in regards to means and are close in their quartiles. But what's noticeable here, is that customers are more widely distributed in how far or near they ride, with a significant increase in the number of customers that return their bikes to the station they started from.
Further analysis will be necessary to explore the statistical significance of these differences, but it would be possible to identify those stations that are frequented by subscribers and assume that most stations within the first standard deviation of the linear distance found below to be considered "subscriber stations" and then seen which stations are outside of those zones to further build up the messaging encouraging subscription. Furthermore, by identifying those "hot zones" it's possible to rotate out bikes to increase their longevity.
%%time
sns.set(style="whitegrid", palette="pastel", color_codes=True)
plt.rcParams['figure.figsize'] = (12, 6)
# Draw a nested violinplot and split the violins for easier comparison
sns.violinplot(x="DayOfWeek", y="LinearDistance", hue="usertype", data=CitiBikeDataSampled_5050, split=True,
inner="quart", palette={"Subscriber": "g", "Customer": "y"})
sns.despine(left=True)
Now that we have the dataset sampled, we still have some legwork necessary to convert our categorical attributes into integer values. Below we walk through this process for the following Attributes:
Once these 5 attributes have been encoded using OneHotEncoding, we have added 79 attributes into our dataset for analysis in our model.
Start Station Name
Initially including all start (and end) locations resulted in excessive system resource loading, later during randomized principal component computations, that froze our personal workstations and eventually ended with Python 'MemoryError' messaging. Therefore, due to the extremely large quantity of start stations in our dataset (330 stations), we were required to reduce this dimension down to a manageable size manually. Through trial and error on top frequency stations, we have chosen to reduce this number down to ~ 10% its original number. By identifying the top 20 start stations for Subscribers / Customers separately, we found that there were 9 overlapping stations, producing a final list of 31 stations. While encoding our start_station_name integer columns, we limit the number of columns to these stations identified.
%%time
#How many Start Stations are there?
print(len(CitiBikeDataSampled_5050["start_station_name"].drop_duplicates()))
# Top 15 Start Station for Subscriber Users
startstationsubfreq = pd.DataFrame({'count' : CitiBikeDataSampled_5050[CitiBikeDataSampled_5050["usertype"] == 'Subscriber'].groupby(["start_station_name"]).size()}).reset_index().sort_values('count',ascending = False)
TopSubStartStations = startstationsubfreq.head(20)
del startstationsubfreq
# Top 15 Start Station for Customer Users
startstationcustfreq = pd.DataFrame({'count' : CitiBikeDataSampled_5050[CitiBikeDataSampled_5050["usertype"] == 'Customer'].groupby(["start_station_name"]).size()}).reset_index().sort_values('count',ascending = False)
TopCustStartStations = startstationcustfreq.head(20)
del startstationcustfreq
#Concat Subscribers and Customers
TopStartStations = pd.DataFrame(pd.concat([TopSubStartStations,TopCustStartStations])["start_station_name"].drop_duplicates()).reset_index()
print(len(TopStartStations))
display(TopStartStations[["start_station_name"]])
del TopStartStations
del TopSubStartStations
del TopCustStartStations
#Split Start Station Values for 50/50 dataset
AttSplit = pd.get_dummies(CitiBikeDataSampled_5050.start_station_name,prefix='start_station_name')
CitiBikeDataSampled_5050 = pd.concat((CitiBikeDataSampled_5050,AttSplit[["start_station_name_Pershing Square N", "start_station_name_E 17 St & Broadway", "start_station_name_8 Ave & W 31 St", "start_station_name_Lafayette St & E 8 St", "start_station_name_W 21 St & 6 Ave", "start_station_name_8 Ave & W 33 St", "start_station_name_W 20 St & 11 Ave", "start_station_name_Broadway & E 14 St", "start_station_name_Broadway & E 22 St", "start_station_name_W 41 St & 8 Ave", "start_station_name_Cleveland Pl & Spring St", "start_station_name_University Pl & E 14 St", "start_station_name_West St & Chambers St", "start_station_name_E 43 St & Vanderbilt Ave", "start_station_name_Broadway & W 24 St", "start_station_name_Greenwich Ave & 8 Ave", "start_station_name_W 18 St & 6 Ave", "start_station_name_Broadway & W 60 St", "start_station_name_Pershing Square S", "start_station_name_W 33 St & 7 Ave", "start_station_name_Central Park S & 6 Ave", "start_station_name_Centre St & Chambers St", "start_station_name_Grand Army Plaza & Central Park S", "start_station_name_Vesey Pl & River Terrace", "start_station_name_Broadway & W 58 St", "start_station_name_West Thames St", "start_station_name_12 Ave & W 40 St", "start_station_name_9 Ave & W 14 St", "start_station_name_W 14 St & The High Line", "start_station_name_State St", "start_station_name_Broadway & Battery Pl"]]),axis=1) # add back into the dataframe
del AttSplit
End Station Name
Similarly, we have an extremely large quantity of end stations in our dataset (330 stations) and including all of them resulted in system crashes during principal component analysis later in our lab. We were required to reduce this dimension down to a manageable size. Through trial and error on top frequency stations, we have chosen to reduce this number down to ~ 10% its original number. By identifying the top 20 end stations for Subscribers / Customers separately, we found that there were 7 overlapping stations, producing a final list of 33 stations. While encoding our end_station_name integer columns, we limit the number of columns to these stations identified.
%%time
#How many End Stations are there?
print(len(CitiBikeDataSampled_5050["end_station_name"].drop_duplicates()))
# Top 15 Start Station for Subscriber Users
endstationsubfreq = pd.DataFrame({'count' : CitiBikeDataSampled_5050[CitiBikeDataSampled_5050["usertype"] == 'Subscriber'].groupby(["end_station_name"]).size()}).reset_index().sort_values('count',ascending = False)
TopSubendStations = endstationsubfreq.head(20)
del endstationsubfreq
# Top 15 Start Station for Customer Users
endstationcustfreq = pd.DataFrame({'count' : CitiBikeDataSampled_5050[CitiBikeDataSampled_5050["usertype"] == 'Customer'].groupby(["end_station_name"]).size()}).reset_index().sort_values('count',ascending = False)
TopCustendStations = endstationcustfreq.head(20)
del endstationcustfreq
#Concat Subscribers and Customers
TopendStations = pd.DataFrame(pd.concat([TopSubendStations,TopCustendStations])["end_station_name"].drop_duplicates()).reset_index()
print(len(TopendStations))
display(TopendStations[["end_station_name"]])
del TopendStations
del TopSubendStations
del TopCustendStations
#Split Start Station Values for 50/50 dataset
AttSplit = pd.get_dummies(CitiBikeDataSampled_5050.end_station_name,prefix='end_station_name')
CitiBikeDataSampled_5050 = pd.concat((CitiBikeDataSampled_5050,AttSplit[["end_station_name_E 17 St & Broadway", "end_station_name_Lafayette St & E 8 St", "end_station_name_8 Ave & W 31 St", "end_station_name_W 21 St & 6 Ave", "end_station_name_Pershing Square N", "end_station_name_W 20 St & 11 Ave", "end_station_name_Broadway & E 14 St", "end_station_name_Broadway & E 22 St", "end_station_name_University Pl & E 14 St", "end_station_name_W 41 St & 8 Ave", "end_station_name_West St & Chambers St", "end_station_name_Cleveland Pl & Spring St", "end_station_name_Greenwich Ave & 8 Ave", "end_station_name_E 43 St & Vanderbilt Ave", "end_station_name_Broadway & W 24 St", "end_station_name_W 18 St & 6 Ave", "end_station_name_MacDougal St & Prince St", "end_station_name_Carmine St & 6 Ave", "end_station_name_8 Ave & W 33 St", "end_station_name_2 Ave & E 31 St", "end_station_name_Central Park S & 6 Ave", "end_station_name_Centre St & Chambers St", "end_station_name_Grand Army Plaza & Central Park S", "end_station_name_Broadway & W 60 St", "end_station_name_Broadway & W 58 St", "end_station_name_12 Ave & W 40 St", "end_station_name_Vesey Pl & River Terrace", "end_station_name_W 14 St & The High Line", "end_station_name_9 Ave & W 14 St", "end_station_name_West Thames St", "end_station_name_State St", "end_station_name_Old Fulton St", "end_station_name_South End Ave & Liberty St"]]),axis=1) # add back into the dataframe
del AttSplit
Gender, DayOfWeek, and TimeOfDay
The rest of our encoding attributes {Gender, DayOfWeek, and TimeOfDay} have the following value permutations. These permutations will be encoded as individual integer columns as well.
%%time
#Split gender Values for 50/50 dataset
AttSplit = pd.get_dummies(CitiBikeDataSampled_5050.gender,prefix='gender')
CitiBikeDataSampled_5050 = pd.concat((CitiBikeDataSampled_5050,AttSplit),axis=1) # add back into the dataframe
del AttSplit
#Split DayOfWeek Values for 50/50 dataset
AttSplit = pd.get_dummies(CitiBikeDataSampled_5050.DayOfWeek,prefix='DayOfWeek')
CitiBikeDataSampled_5050 = pd.concat((CitiBikeDataSampled_5050,AttSplit),axis=1) # add back into the dataframe
del AttSplit
#Split TimeOfDay Values for 50/50 dataset
AttSplit = pd.get_dummies(CitiBikeDataSampled_5050.TimeOfDay,prefix='TimeOfDay')
CitiBikeDataSampled_5050 = pd.concat((CitiBikeDataSampled_5050,AttSplit),axis=1) # add back into the dataframe
del AttSplit
display(CitiBikeDataSampled_5050.head())
With these encodings complete, our final dataset to cross-validate on test/train datasets would appear to be complete. However, given the large number of attributes now present in our dataset, it would be wise to investigate a means of dimensionality reduction to not only speed up model generation, but to also improve accuracy by removing variable redundancy and correlation.
At this stage, we've converted our original 30 variables into 107 attributes after creating dummy variables for categorical data such as day of the week, time of day, station names, gender, etc. These 107 attributes and their data types are as follows:
%%time
data_type = []
for idx, col in enumerate(CitiBikeDataSampled_5050.columns):
data_type.append(CitiBikeDataSampled_5050.dtypes[idx])
summary_df = {'Attribute Name' : pd.Series(CitiBikeDataSampled_5050.columns, index = range(len(CitiBikeDataSampled_5050.columns))), 'Data Type' : pd.Series(data_type, index = range(len(CitiBikeDataSampled_5050.columns)))}
summary_df = pd.DataFrame(summary_df)
display(summary_df)
del data_type, summary_df
Because our stratified, processed data is comprised of 105 various attributes ranging from weather and distance data to location and user type data across all 500,000 sample observations, and some variables such as weather attributes and even some start and end stations correlate to one another, we felt it would be wise during our previous Lab 2 analysis to reduce our number of attributes considered during Customer/Subscriber prediction. We proceeded to use Principal Component Analysis (PCA) to reduce the dimensionality of our dataset.
Furthermore, during analysis of our principal components' loadings, we identified only 22 of the originally processed 105 attributes as being contextually important in identifying Customers and Subscribers. As the intent of our Lab 3 analysis is to further identify Customer users that should be Subscribers based on their behaviour, we deem it wise to only use these 22 attributes while clustering as well.
These attributes are as follows:
In addition to using only these attributes while clustering, we chose to split our stratified sample data set of 500,000 transactions into separate Customer and Subscriber data sets while clustering. This will provide us the advantage of identifying clusters based on Customer data separately from Subscriber data - the advantage being that further granularity will be offered into the user sub-groups (based on transaction details) that comprise each user class. When later comparing these clusterings between each user class, we will be able to further classify one user type's data against the opposite user type's cluster IDs. This, and its implementation, will be described in much greater detail in the Deployment section. Currently, it suffices to say that clustering against each known user type is a necessary means of identifying would-be Subscribers.
%%time
# Subset data set to only variables identified to have the greatest PCA loadings
attr_clus = CitiBikeDataSampled_5050[['start_station_latitude',
'start_station_longitude',
'end_station_latitude',
'end_station_longitude',
'PRCP',
'SNOW',
'TAVE',
'TMAX',
'TMIN',
'DayOfWeek_Friday',
'DayOfWeek_Monday',
'DayOfWeek_Saturday',
'DayOfWeek_Sunday',
'DayOfWeek_Thursday',
'DayOfWeek_Tuesday',
'DayOfWeek_Wednesday',
'TimeOfDay_Afternoon',
'TimeOfDay_Evening',
'TimeOfDay_Midday',
'TimeOfDay_Morning',
'TimeOfDay_Night',
'tripdurationLog',
'usertype']]
attr_scaled = attr_clus.ix[:,0:(len(attr_clus.columns)-1)] #Remove usertype from scaled columns
scaler = StandardScaler().fit(attr_scaled)
CitiBike_clus = scaler.transform(attr_scaled)
CitiBike_clus = pd.DataFrame(CitiBike_clus)
users = pd.DataFrame(CitiBikeDataSampled_5050.usertype)
CitiBike_clus = pd.concat([CitiBike_clus.reset_index(), users.reset_index()], axis = 1)
del CitiBike_clus['index']
CitiBike_clus.columns = attr_clus.columns
CitiBike_C = CitiBike_clus.loc[CitiBike_clus['usertype'] == 'Customer']
CitiBike_S = CitiBike_clus.loc[CitiBike_clus['usertype'] == 'Subscriber']
#min_max_scaler = MinMaxScaler()
#scaled = min_max_scaler.fit_transform(CitiBike_C.ix[:,0:(len(CitiBike_C.columns)-1)])
#cols = CitiBike_C.ix[:,0:(len(CitiBike_C.columns)-1)].columns
#CitiBike_C = pd.DataFrame(scaled, columns=cols)
#
#min_max_scaler = MinMaxScaler()
#scaled = min_max_scaler.fit_transform(CitiBike_S.ix[:,0:(len(CitiBike_S.columns)-1)])
#cols = CitiBike_S.ix[:,0:(len(CitiBike_S.columns)-1)].columns
#CitiBike_S = pd.DataFrame(scaled, columns=cols)
#
print('Customer data dimensions =', CitiBike_C.shape)
print('Subscriber data dimensions =',CitiBike_S.shape)
In addition to separting the data based on user type, we also scaled the data values to remove bias while clustering due to different attribute scales. Without scaling the data, attributes such as station coordinates and trip duration would carry heavier weights when compared against the OneHotEncoded attributes and precipitation data. This would cause unbalanced and improperly clustered groups. The first 5 standardized ride transactions are shown below for Customers and Subscribers as an example of what the data looks like after scaling.
display(CitiBike_C.head())
display(CitiBike_S.head())
With our attributes identified and our datasets segregated into customers and subscribers we began with our first clustering algorithm: Kmeans. With 22 attributes and 250,000 observations in each dataset, KMeans stands to potentially be a good fit. Our first step is in identifying the number of clusters to use. Because our dataset isn't entirely quantitative, we based our decision on the silhouette scores for clusters rather than on pure accuracy scores. Unfortunately, silhouette scores is resource intensive so we were forced to sample each dataset to reach a point where we could consistently run the silhouette graphing process. We sampled for 25000 observations (or about 10%) with replacement.
With a sample size decided upon, we ran the dataset through mulitple iterations of silhouette analysis using an incremental series of cluster sizes from 2 clusters to 30 (going up to 70 with the subscriber dataset). Each iteration was a different seed, letting us pull a range of random samples to let us see the general idea of which cluster sizes would yield the best results. Below are the series with the best silhouette scores in comparison to their silhouette graphs.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
kmeans_S = CitiBike_S.sample(n=25000, replace=True, random_state=10101)
kmeans_y = kmeans_S['usertype']
kmeans_X = kmeans_S[["start_station_latitude","start_station_longitude","end_station_latitude","end_station_longitude","PRCP","SNOW","TAVE","TMAX","TMIN","DayOfWeek_Friday","DayOfWeek_Monday","DayOfWeek_Saturday","DayOfWeek_Sunday","DayOfWeek_Thursday","DayOfWeek_Tuesday","DayOfWeek_Wednesday","TimeOfDay_Afternoon","TimeOfDay_Evening","TimeOfDay_Midday","TimeOfDay_Morning","TimeOfDay_Night","tripdurationLog"]]
# Generating the sample data from make_blobs
# This particular setting has one distinct cluster and 3 clusters placed close
# together.
#For testing and finding number of clusters
#range_n_clusters = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
#range_n_clusters = [51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70]
#Did see equivalent averages at around n clusters 25 to 30, but that was due to extreme spikes due to small cluster size
range_n_clusters = [9, 10, 11, 12, 13]
for n_clusters in range_n_clusters:
# Create a subplot with 1 row and 2 columns
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(kmeans_X) + (n_clusters + 1) * 10])
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(kmeans_X)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(kmeans_X, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(kmeans_X, cluster_labels)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
# 2nd Plot showing the actual clusters formed
colors = cm.spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(kmeans_X["start_station_longitude"], kmeans_X["start_station_latitude"], marker='.', s=30, lw=0, alpha=0.7,
c=colors)
# Labeling the clusters
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1],
marker='o', c="white", alpha=1, s=200)
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50)
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")
plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')
plt.show()
For n_clusters = 9 The average silhouette_score is : 0.199781516478
For n_clusters = 10 The average silhouette_score is : 0.196131086326
For n_clusters = 11 The average silhouette_score is : 0.203819623072
For n_clusters = 12 The average silhouette_score is : 0.201432376304
For n_clusters = 13 The average silhouette_score is : 0.187475393763

The subscriber dataset consistently had at least one cluster containing a negative silhouette score range. Further analysis of the dataset would be necessary to identify why and how this collection of observations would have such low scores, so we were forced to select a dataset that had a high average silhouette score with as low a negative range as possible. Above are the top five cluster sizes based upon this criteria. Ultimately we decided to go with 11 clusters.
It should be noted that we also explored cluster sizes above 50. While we were eventually able to see an average score of above 0.24, it was mainly due to narrow cluster sizes and high score peaks. Once above a cluster size of 15, while the score might improve, multiple clusters would have negative ranges with extreme peaks.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
kmeans_C = CitiBike_C.sample(n=25000, replace=True, random_state=10101)
kmeans_y = kmeans_C['usertype']
kmeans_X = kmeans_C[["start_station_latitude","start_station_longitude","end_station_latitude","end_station_longitude","PRCP","SNOW","TAVE","TMAX","TMIN","DayOfWeek_Friday","DayOfWeek_Monday","DayOfWeek_Saturday","DayOfWeek_Sunday","DayOfWeek_Thursday","DayOfWeek_Tuesday","DayOfWeek_Wednesday","TimeOfDay_Afternoon","TimeOfDay_Evening","TimeOfDay_Midday","TimeOfDay_Morning","TimeOfDay_Night","tripdurationLog"]]
# Generating the sample data from make_blobs
# This particular setting has one distinct cluster and 3 clusters placed close
# together.
#For testing and finding number of clusters
#range_n_clusters = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
#Did see equivalent averages at around n clusters 25 to 30, but that was due to extreme spikes due to small cluster size
range_n_clusters = [6, 7, 8, 9, 10]
for n_clusters in range_n_clusters:
# Create a subplot with 1 row and 2 columns
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(kmeans_X) + (n_clusters + 1) * 10])
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(kmeans_X)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(kmeans_X, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(kmeans_X, cluster_labels)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
# 2nd Plot showing the actual clusters formed
colors = cm.spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(kmeans_X["start_station_longitude"], kmeans_X["start_station_latitude"], marker='.', s=30, lw=0, alpha=0.7,
c=colors)
# Labeling the clusters
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1],
marker='o', c="white", alpha=1, s=200)
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50)
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")
plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')
plt.show()
For n_clusters = 6 The average silhouette_score is : 0.196749844481
For n_clusters = 7 The average silhouette_score is : 0.221545306167
For n_clusters = 8 The average silhouette_score is : 0.177520274085
For n_clusters = 9 The average silhouette_score is : 0.183981689115
For n_clusters = 10 The average silhouette_score is : 0.179497780886

Again, like with the subscriber dataset, we looked at a range of clusters for customers. What was perhaps surprising with this dataset is that, despite the same features, we needed fewer clusters to represent the data. Similarly though, higher cluster counts did result in higher silhouette scores, but at less uniform cluster sizes, higher peaks and more negative ranges. Ultimately we settled on 7 clusters for our customer dataset.
Our next approach to clustering the CitiBike user data is to implement DBSCAN on the 22 attributes selected for clustering. However, initial clustering attempts while including longitude and latitude data resulted in processing failure due to memory errors. For this reason, we have chosen to remove start and end station coordinate data from the attributes identified for clustering.
While this differs from our other approaches to clustering in this lab, treating coordinates as such is a necessary action given the processing problems encounted on our workstations. Our theory is that it is in part due to having so many transactions with the exact same geo-coordinate values. One solution might be to OneHotEncode coordinate values, treating them as categorical data. The downside to this is that it would significantly increase the number of attributes in our data set. Given the problem and scope of this analysis, we've opted for their removal instead.
#CitiBike_dbscan = CitiBike_clus[['start_station_latitude','start_station_longitude']].drop_duplicates()
#CitiBike_dbscan = CitiBike_clus[['start_station_latitude','start_station_longitude']]
C_dbscan = CitiBike_C.ix[:,4:22]
S_dbscan = CitiBike_S.ix[:,4:22]
CitiBike_loc = CitiBike_clus[['start_station_latitude','start_station_longitude']].drop_duplicates()
print('Customer dbscan data dimensions =', C_dbscan.shape)
print('Subscriber dbscan data dimensions =', S_dbscan.shape)
Coordinate removal aside, we do have some reservations about implementing DBSCAN clustering, period, due to the density method being based on Euclidean distance measures. Even with 18 attributes instead of 22, the high dimensionality within our Customer and Subscriber data sets is expected to minimize the effectiveness of using distance as the primary measure. Nevertheless, such an attempt is still worth while in order to compare the results against our other clustering methods.
Of course, another expected limitation is that of size. DBSCAN is not very efficient with our large data set since a full pairwise similarity matrix must be constructed, amounting to $n^2$ computations each time the algorithm is run (See Sci-Kit Learn's description). So DBSCAN is expected to be slow to implement, making parameter adjustments difficult and time consuming.
Now moving on with the analysis, we will later plot the cluster IDs against the coordinates in order to gain better insight into which cluster ID's appear at each station (Again, coordinate attributes are removed from the DBSCAN cluster data... plotting against coordinates is for marketing application purposes only as will be described in the deployment section). The initial coordinates, without clustering, are depicted in the following scatterplot.
%%time
plt.figure(figsize=(15,5))
#plt.plot(CitiBike_dbscan.start_station_longitude, CitiBike_dbscan.start_station_latitude, 'bo', markersize=3) #plot the data
plt.plot(CitiBike_loc.start_station_longitude, CitiBike_loc.start_station_latitude, 'bo', markersize=4, markerfacecolor='b') #plot the data
plt.title('Latitude/Longitude Data'.format(2))
plt.xlabel('Scaled Longitude Coordinate'.format(2))
plt.ylabel('Scaled Latitude Coordinate'.format(2))
plt.xlim(-10,10)
plt.ylim(-3,2.5)
plt.ticklabel_format(style='plain', axis='x')
plt.grid()
plt.ticklabel_format(useOffset=False)
plt.show()
As mentioned previously, customer and subscriber data will be clustered separately in order to help identify customers that should be subscribers but aren't, as will be discussed in greater detail in the Deployment section.
Below are functions that will be utilized throughout the DBSCAN clustering process. The getGraph function simply generates a k-neighbors graph of the data in order to plot potential eps values based on a prescribed minimum number of samples to be used in the DBSCAN algorithm. Next, the epsPlot function plots these potential eps values. Finally, our dbs function clusters the data based on prescribed eps and minimum number of samples parameters. Because we plan to plot the clusters against station locations, we adopted and modified a couple functions from this Stackoverflow post to jitter our data points while depicting a scatterplot. These functions are provided below as well.
%%time
def rand_jitter(arr):
stdev = arr.max()/100.
return arr + np.random.randn(len(arr)) * stdev
def jitter(x, y, s=20, c='b', marker='o', cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None, verts=None, hold=None, **kwargs):
return plt.scatter(rand_jitter(x), rand_jitter(y), s=s, c=c, marker=marker, cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None, verts=None, hold=None, **kwargs)
def getGraph(N, data):
graph = kneighbors_graph(data, n_neighbors = N, mode='distance') # calculate distance to nearest neighbors
#CitiBike_knn_graph = kneighbors_graph(CitiBike_clus, n_neighbors = N, mode='distance') # calculate distance to nearest neighbors
return graph
def epsPlot(N, graph):
N1 = graph.shape[0]
CitiBike_distances = np.zeros((N1,1))
for i in range(N1):
CitiBike_distances[i] = graph[i,:].max()
CitiBike_distances = np.sort(CitiBike_distances, axis=0)
plt.figure(figsize=(15,5))
#plt.subplot(1,2,1)
plt.plot(range(N1), CitiBike_distances, 'r.', markersize=4) #plot the data
plt.title('Dataset name: CitiBike_clus, sorted by neighbor distance')
plt.xlabel('CitiBike_clus, Instance Number')
plt.ylabel('CitiBike_clus, Distance to {0}th nearest neighbor'.format(N))
#plt.xlim([400000,500000])
#plt.plot([0, 350], [0.0054, 0.0054], 'k--', lw=0.5)
#plt.grid()
def dbs(eps, minpts, data):
#db = DBSCAN(eps=eps, min_samples=minpts).fit(CitiBike_dbscan)
db = DBSCAN(eps=eps, min_samples=minpts, n_jobs=-1).fit(data)
#db = DBSCAN(eps=eps, min_samples=minpts, metric='cosine', algorithm='brute').fit(CitiBike_clus)
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
# mark the samples that are considered "core"
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
plt.figure(figsize=(15,4))
unique_labels = set(labels) # the unique labels
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = 'k'
class_member_mask = (labels == k)
if(data.equals(C_dbscan)):
xy = CitiBike_C[class_member_mask & core_samples_mask]
elif(data.equals(S_dbscan)):
xy = CitiBike_S[class_member_mask & core_samples_mask]
# plot the core points in this class
#plt.plot(xy.start_station_longitude, xy.start_station_latitude, '.', markerfacecolor=col,
# markeredgecolor='w', markersize=6)
jitter(xy.start_station_longitude, xy.start_station_latitude, c=col, s=10)
# plot the remaining points that are edge points
if(data.equals(C_dbscan)):
xy = CitiBike_C[class_member_mask & ~core_samples_mask]
elif(data.equals(S_dbscan)):
xy = CitiBike_S[class_member_mask & ~core_samples_mask]
#plt.plot(xy.start_station_longitude, xy.start_station_latitude, '.', markerfacecolor=col,
# markeredgecolor='w', markersize=3)
jitter(xy.start_station_longitude, xy.start_station_latitude, c=col, s=10)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.xlim(-10,10)
plt.ylim(-3,2.5)
plt.grid()
plt.ticklabel_format(useOffset=False)
plt.show()
return(db)
Before running the DBSCAN algorithm on our customer data set, we would first like to obtain a basic understanding of what types of eps and minimum number of samples to use. Due to the multi-dimensionality of our data, and the fact that we have 250,000 observations in our customer data set, it is difficult to define an appropriate mininum number of samples through visual inspection of the data. Therefore, we have chosen a preliminary sample count of 200 from which we will identify an eps value to use as our initial starting point.
The eps plot below is configured such that customer transaction instances are ordered from least to greatest distance to their 200th nearest neighbor. To determine a good starting point, we identify a point where the distance starts to increase (looking for an inflection point upward). This is a bit challenging given the plotted results, but, again, this is only a starting point for our permutations. The arguable point identified in this case is at an eps value of about 0.5. This will be our starting value used in our first customer DBSCAN permutation.
%%time
N = 200
CitiBike_knn_graph = getGraph(N, C_dbscan)
epsPlot(N, CitiBike_knn_graph)
plt.annotate('Expected Eps value = approx. 0.5', xy=(225000, 0.7), xytext=(150000, 10),
arrowprops=dict(facecolor='black', shrink=0.05),)
plt.show()
As mentioned previously, a minimal core sample count of 200 was selected rather arbitrarily since it is very difficult to identify a good starting point visually. Given that our customer data contains 250,000 transactions and we'd like to obtain DBSCAN results producing no more than ten to twenty clusters at most, 200 seemed like a decent starting point with the correct eps value. After a few DBSCAN runs, we decided to adjust this number based on results produced when adjusting the eps value.
DBSCAN's inefficiencies with large data sets were quickly exemplified as we performed the following permutations:
These permutations are performed below and their outputs pickled in order to save time each time our analysis code is run though again. The concept of pickling our results will be discussed further in the Deployment section.
Cluster visualizations will be provided and discussed in the Modeling and Evaluation Part 3 section later in this writeup and, therefore, are not provided immediately below.
%%time
if os.path.isfile("PickleFiles/db_c.pkl"):
print("Found the File!")
db_c = unpickleObject("db_c")
else: db_c = dbs(0.5, 200, C_dbscan)
%%time
if os.path.isfile("PickleFiles/db_c1.pkl"):
print("Found the File!")
db_c1 = unpickleObject("db_c1")
else: db_c1 = dbs(0.7, 200, C_dbscan)
%%time
if os.path.isfile("PickleFiles/db_c2.pkl"):
print("Found the File!")
db_c2 = unpickleObject("db_c2")
else: db_c2 = dbs(0.9, 200, C_dbscan)
%%time
if os.path.isfile("PickleFiles/db_c3.pkl"):
print("Found the File!")
db_c3 = unpickleObject("db_c3")
else: db_c3 = dbs(1.0, 200, C_dbscan)
%%time
if os.path.isfile("PickleFiles/db_c4.pkl"):
print("Found the File!")
db_c4 = unpickleObject("db_c4")
else: db_c4 = dbs(1.0, 250, C_dbscan)
%%time
if os.path.isfile("PickleFiles/db_c5.pkl"):
print("Found the File!")
db_c5 = unpickleObject("db_c5")
else: db_c5 = dbs(1.0, 300, C_dbscan)
%%time
if os.path.isfile("PickleFiles/db_c6.pkl"):
print("Found the File!")
db_c6 = unpickleObject("db_c6")
else: db_c6 = dbs(1.2, 200, C_dbscan)
%%time
if os.path.isfile("PickleFiles/db_c7.pkl"):
print("Found the File!")
db_c7 = unpickleObject("db_c7")
else: db_c7 = dbs(2.0, 200, C_dbscan)
%%time
if os.path.isfile("PickleFiles/db_c.pkl"):
print("Found the File!")
else: pickleObject(db_c, "db_c", filepath = "PickleFiles/")
if os.path.isfile("PickleFiles/db_c1.pkl"):
print("Found the File!")
else: pickleObject(db_c1, "db_c1", filepath = "PickleFiles/")
if os.path.isfile("PickleFiles/db_c2.pkl"):
print("Found the File!")
else: pickleObject(db_c2, "db_c2", filepath = "PickleFiles/")
if os.path.isfile("PickleFiles/db_c3.pkl"):
print("Found the File!")
else: pickleObject(db_c3, "db_c3", filepath = "PickleFiles/")
if os.path.isfile("PickleFiles/db_c4.pkl"):
print("Found the File!")
else: pickleObject(db_c4, "db_c4", filepath = "PickleFiles/")
if os.path.isfile("PickleFiles/db_c5.pkl"):
print("Found the File!")
else: pickleObject(db_c5, "db_c5", filepath = "PickleFiles/")
if os.path.isfile("PickleFiles/db_c6.pkl"):
print("Found the File!")
else: pickleObject(db_c6, "db_c6", filepath = "PickleFiles/")
if os.path.isfile("PickleFiles/db_c7.pkl"):
print("Found the File!")
else: pickleObject(db_c7, "db_c7", filepath = "PickleFiles/")
Similar to what was done for our customer data above, we again chose a minimum core sample count of 200 as a preliminary starting point for subscriber data, not knowing exactly how many samples to define given our 22 dimension data set. We fully expect the results of the following eps plot be different from the customer data's plot because, as has been clearly identified in our previous labs and visually earlier in this lab, customers and subscribers behave differently. Therefore we expect them to cluster differently as well.
The subscriber eps plot below is configured as was the plot for customer data above.The arguable point identified in this case is at an eps value of about 1.0. This will be our starting value used in our first subscriber DBSCAN permutation.
%%time
N = 200
CitiBike_knn_graph = getGraph(N, S_dbscan)
epsPlot(N, CitiBike_knn_graph)
plt.annotate('Expected Eps value = approx. 1.0', xy=(225000, 1.0), xytext=(150000, 5.5),
arrowprops=dict(facecolor='black', shrink=0.05),)
plt.show()
With our starting points identified, we again planned to conduct 8 different permutations to narrow down to the best subscriber DBSCAN configuration. These configurations' parameters are outlined below along with their final run times (Note that subscriber permutations were run on a different workstation than the customer permutations to make efficient use of work time. So, these times are not meant to be an apples-to-apples comparison against customer times.):
After reviewing results for these eight permutations, we decided it was worth also attempting two more permutations as outlined below:
Again, these permutations are performed below but their outputs are pickled in order to save time with each run of our code. Cluster visualizations will be provided and discussed in the Modeling and Evaluation Part 3 section later in this writeup and, therefore, are not provided immediately below.
%%time
if os.path.isfile("PickleFiles/db_s.pkl"):
print("Found the File!")
db_s = unpickleObject("db_s")
else: db_s = dbs(1.0, 200, S_dbscan)
%%time
if os.path.isfile("PickleFiles/db_s1.pkl"):
print("Found the File!")
db_s1 = unpickleObject("db_s1")
else: db_s1 = dbs(0.8, 200, S_dbscan)
%%time
if os.path.isfile("PickleFiles/db_s2.pkl"):
print("Found the File!")
db_s2 = unpickleObject("db_s2")
else: db_s2 = dbs(0.5, 200, S_dbscan)
%%time
if os.path.isfile("PickleFiles/db_s3.pkl"):
print("Found the File!")
db_s3 = unpickleObject("db_s3")
else: db_s3 = dbs(0.5, 150, S_dbscan)
%%time
if os.path.isfile("PickleFiles/db_s4.pkl"):
print("Found the File!")
db_s4 = unpickleObject("db_s4")
else: db_s4 = dbs(0.5, 250, S_dbscan)
%%time
if os.path.isfile("PickleFiles/db_s5.pkl"):
print("Found the File!")
db_s5 = unpickleObject("db_s5")
else: db_s5 = dbs(0.5, 300, S_dbscan)
%%time
if os.path.isfile("PickleFiles/db_s6.pkl"):
print("Found the File!")
db_s6 = unpickleObject("db_s6")
else: db_s6 = dbs(0.5, 350, S_dbscan)
%%time
if os.path.isfile("PickleFiles/db_s7.pkl"):
print("Found the File!")
db_s7 = unpickleObject("db_s7")
else: db_s7 = dbs(0.5, 450, S_dbscan)
%%time
if os.path.isfile("PickleFiles/db_s8.pkl"):
print("Found the File!")
db_s8 = unpickleObject("db_s8")
else: db_s8 = dbs(1, 450, S_dbscan) ## This is our WINNER
%%time
if os.path.isfile("PickleFiles/db_s9.pkl"):
print("Found the File!")
db_s9 = unpickleObject("db_s9")
else: db_s9 = dbs(1.2, 450, S_dbscan)
%%time
if os.path.isfile("PickleFiles/db_s.pkl"):
print("Found the File!")
else: pickleObject(db_s, "db_s", filepath = "PickleFiles/")
if os.path.isfile("PickleFiles/db_s1.pkl"):
print("Found the File!")
else: pickleObject(db_s1, "db_s1", filepath = "PickleFiles/")
if os.path.isfile("PickleFiles/db_s2.pkl"):
print("Found the File!")
else: pickleObject(db_s2, "db_s2", filepath = "PickleFiles/")
if os.path.isfile("PickleFiles/db_s3.pkl"):
print("Found the File!")
else: pickleObject(db_s3, "db_s3", filepath = "PickleFiles/")
if os.path.isfile("PickleFiles/db_s4.pkl"):
print("Found the File!")
else: pickleObject(db_s4, "db_s4", filepath = "PickleFiles/")
if os.path.isfile("PickleFiles/db_s5.pkl"):
print("Found the File!")
else: pickleObject(db_s5, "db_s5", filepath = "PickleFiles/")
if os.path.isfile("PickleFiles/db_s6.pkl"):
print("Found the File!")
else: pickleObject(db_s6, "db_s6", filepath = "PickleFiles/")
if os.path.isfile("PickleFiles/db_s7.pkl"):
print("Found the File!")
else: pickleObject(db_s7, "db_s7", filepath = "PickleFiles/")
if os.path.isfile("PickleFiles/db_s8.pkl"):
print("Found the File!")
else: pickleObject(db_s8, "db_s8", filepath = "PickleFiles/")
if os.path.isfile("PickleFiles/db_s9.pkl"):
print("Found the File!")
else: pickleObject(db_s9, "db_s9", filepath = "PickleFiles/")
The final clustering method we will apply to our data set is Spectral Clustering. Due to the multi-dimensional state of our data, our expectation is that spectral clustering will outperform K-means and DBSCAN clustering as it does not rely on distance measures directly but rather node connectivity. Another expected advantage to using Spectral Clustering is that it allows us to reduce the dimensionality of our data by clustering the data's first few Laplacian matrix eigenvectors only instead of all 22 attributes. As such, it does require a bit more data prep before clustering such as computing Similarity and Affinity, Degree, and Laplacian matrices, but the anticipated outcome is worth the effort. It is by calculating these Laplacian matrix eigenvectors that we are also able to identify an appropriate number of clusters to feed into the algorithm. Cluster count identification will be revealed and discussed in the Modeling and Evaluation Part 2 - Evaluate and Compare section.
For Spectral Clustering we chose to use R not only for a change of pace but also because documentation for running the algorithm in R is plentiful and application using R overlapped with one of our group members' projects at work. A similar outcome could have been reached with Python just as well, but integrating R offered a fun, new challenge on our Windows OS machines.
Before computing our required matrices, we must first intialize our rpy2.ipython session and pass to it our scaled Customer and Subscriber data. We also check to ensure the data imported correctly as an R dataframe and that attribute data types are as expected.
%load_ext rpy2.ipython
#%R install.packages("kernlab")
%R library(kernlab)
clear_display()
%%time
%R -i CitiBike_C
%R -i CitiBike_S
%%time
out = %R capture.output(str(CitiBike_C))
for line in out:
print(line)
print("")
out = %R capture.output(str(CitiBike_S))
for line in out:
print(line)
After initial attempts to compute similarity matrices for all 250,000 transactions from each user type data set, it became quickly apparent we would need to sub-sample our data further. We discovered that the relationship between sample size and CPU time required to compute the Similarity, Affinity, Degree, and Laplacian matrices is not linear. For example, when random-sampling 500 transcations from each data set, matrix computations took 11 minutes to complete for each data set (22 minutes in total between both data sets). With 1000 samples randomly selected for matrix derivation, however, required CPU runtime significantly increased to 44 minutes per data set (that's 1.5 hours in total between both data sets). Given the significant amount of time spent running our DBSCAN code above, we opted to sample no more than 1000 transactions based on our project schedule. We later address this loss in granularity by predicting the remaining transactions' cluster IDs via classification methods in our Deployment section.
%%time
%%R
set.seed(100)
CitiBike_miniC <- CitiBike_C[sample(1:nrow(CitiBike_C), 1000, replace=FALSE),]
CitiBike_miniS <- CitiBike_S[sample(1:nrow(CitiBike_S), 1000, replace=FALSE),]
In final preparation for matrix computation, we configure our Similarity and Affinity matrix functions which are a bit more extensive than our Degree and Laplacian matrix computations. These functions were adapted from a tutorial written by Joao Neto in December 2013. Similarly, we based our eigenvector derivations on this same tutorial.
%%time
%%R
# Similarity matrix computation functions
s <- function(x1, x2, alpha=1) {
exp(- alpha * norm(as.matrix(x1-x2), type="F"))
}
make.similarity <- function(my.data, similarity) {
N <- nrow(my.data)
S <- matrix(rep(NA,N^2), ncol=N)
for(i in 1:N) {
for(j in 1:N) {
S[i,j] <- similarity(my.data[i,], my.data[j,])
}
}
S
}
# Affinity matrix computation functions
make.affinity <- function(S, n.neighboors=2) {
N <- length(S[,1])
if (n.neighboors >= N) { # fully connected
A <- S
} else {
A <- matrix(rep(0,N^2), ncol=N)
for(i in 1:N) { # for each line
# only connect to those points with larger similarity
best.similarities <- sort(S[i,], decreasing=TRUE)[1:n.neighboors]
for (s in best.similarities) {
j <- which(S[i,] == s)
A[i,j] <- S[i,j]
A[j,i] <- S[i,j] # to make an undirected graph, ie, the matrix becomes symmetric
}
}
}
A
}
We first tackle the Customer data set by calculating our Similarity, Affinity, Degree, and Laplacian matrices. Due to matrix computation CPU requirements (42min 43s), we save our Laplacian matrix object as a RDS file for readback, similar to our previous pickles.
The Similarity and Affinity matrices serve a similar purpose to an Adjacency matrix by indicating node proximity with edges indicating links. Once the Affinity matrix is produced, it is used to compute the Degree matrix which is a diagonal matrix summing the number of edges for each node. Next, the Laplacian matrix is computed from both the Degree and Affinity matrices. We opted to implement the unnormalized graph Laplacian for our application as our data is already scaled. The formula for this is simply: $Laplacian = Degree - Affinity$
%%time
%%R
set.seed(100)
if(!file.exists('PickleFiles/CustLap.rds')){
S <- make.similarity(CitiBike_miniC[,1:22], s)
A <- make.affinity(S, 3) # use 3 neighboors (includes self)
D <- diag(apply(A, 1, sum)) # sum rows
U <- D - A
saveRDS(U, 'PickleFiles/CustLap.rds')
}
else{U <- readRDS('PickleFiles/CustLap.rds')}
After the final Laplacian matrix is complete, we move on to eigenvector computation. Specifically, we are interested in deriving the first eigenvector and the second eigenvector, or Fiedler Vector. These are required for us to cluster our data in a reduced dimensional space by identifying a location of normalized cut in the Fiedler Vector. This cut location is revealed by analyzing both the number of edges in each cluster that do not connect to other clusters and the number of edges which do connect each cluster. Similarly, relative connectivity and interconnectivity may be used to evaluate a proper cut: $RC*RI^a$
%%time
%%R
set.seed(100)
k <- 2
evL <- eigen(U, symmetric=TRUE)
Z <- evL$vectors[,(ncol(evL$vectors)-k+1):ncol(evL$vectors)]
evLc <- evL
The above matrix and eigenvalue computations are the raw, manual methods used to transform the data set into a reduced dimensional space which can be easily clustered. Doing so manually is essential to identifying the correct number of clusters to feed into our clustering algorithm. This cluster count identification will be discussed in detail in the Modeling and Evaluation Part 2 - Evaluate and Compare section, but it suffices us to say for now that the appropriate number of clusters for customer data is 3.
Fortunately, besides computing eigenvectors for cluster count identification, the specc R library can perform these computations for us and then cluster on the produced eigenvectors. All that is required of us beforehand is to convert our 1000 sample dataframe into a matrix.
%%time
%%R
set.seed(100)
#%R mCiti <- data.matrix(CitiBike_mini[,5:22])
mCitiC <- data.matrix(CitiBike_miniC)
colnames(mCitiC) <- colnames(CitiBike_miniC)
The specc Spectral Clustering library function provides a number of parameters we can adjust to obtain varying cluster results. Permutations are provided below and are fast enough that no further RDS object storage is required.
Centers defines the number of clusters to produce as one would expect. Iterations in this case defines the maximum number of iterations allowed for the clustering algorithm to perform; the default number of iterations is 200. The kernel argument defines the kernel function used to compute the affinity matrix within the specc call; the default is 'rbfdot' which is the Radial Basis, or Gaussian, kernel function. Other options provided in our configurations include 'laplacedot' and 'polydot', where 'laplacedot' is the Laplacian kernel function and 'polydot' is the Polynomial kernel function. Finally, nystrom.red is a boolean argument signifying whether to use Nystrom's Method to calculate eigenvectors. The default is False, but when changed to True, a sample of the dataset is used to compute the eigenvalues so that only an $n*m$ matrix ($n$ being the sample size) is stored in memory. We use the default number of Nystrom's Method samples which is $\frac{nrow}{6}$, or 167 given our data.
%%time
%%R
set.seed(100)
sc_c <- specc(mCitiC, centers = 3)
%%time
%%R
set.seed(100)
sc_c1 <- specc(mCitiC, centers = 3, iterations = 500)
%%time
%%R
set.seed(100)
sc_c2 <- specc(mCitiC, centers = 3, iterations = 1000)
%%time
%%R
set.seed(100)
sc_c3 <- specc(mCitiC, centers = 3, iterations = 1000, kernel = 'laplacedot')
%%time
%%R
set.seed(100)
sc_c4 <- specc(mCitiC, centers = 3, iterations = 1000, kernel = 'polydot')
%%time
%%R
set.seed(100)
sc_c5 <- specc(mCitiC, centers = 3, iterations = 1000, kernel = 'rbfdot', nystrom.red = TRUE)
We next tackle the Subscriber data set by calculating our Similarity, Affinity, Degree, and Laplacian matrices. Due to matrix computation CPU requirements (48min 29s), we save our Laplacian matrix object as a RDS file for readback, as was done with Customer spectral clustering.
%%time
%%R
set.seed(100)
if(!file.exists('PickleFiles/SubsLap.rds')){
S <- make.similarity(CitiBike_miniS[,1:22], s)
A <- make.affinity(S, 3) # use 3 neighboors (includes self)
D <- diag(apply(A, 1, sum)) # sum rows
U <- D - A
saveRDS(U, 'PickleFiles/SubsLap.rds')
}
else{U <- readRDS('PickleFiles/SubsLap.rds')}
After the final Laplacian matrix is complete, we move on to eigenvector computation to obtain our first eigenvector and Fiedler Vector.
%%time
%%R
set.seed(100)
k <- 2
evL <- eigen(U, symmetric=TRUE)
Z <- evL$vectors[,(ncol(evL$vectors)-k+1):ncol(evL$vectors)]
evLs <- evL
signif(evL$values,2) # eigenvalues are in decreasing order
The above matrix and eigenvalue computations are the raw, manual methods used to transform the data set into a reduced dimensional space which can be easily clustered. Doing so manually is essential to identifying the correct number of clusters to feed into our clustering algorithm. This cluster count identification will be discussed in detail in the Modeling and Evaluation Part 2 - Evaluate and Compare section, but it suffices us to say for now that the appropriate number of clusters to start with for customer data is 4. 5 clusters were also identified as a possible candidate and will be explored here as well.
Again, the specc R library will re-perform matrix computations for us and then cluster on the produced eigenvectors. All that is required of us beforehand is to convert our 1000 sample dataframe into a matrix.
%%time
%%R
set.seed(100)
#%R mCiti <- data.matrix(CitiBike_mini[,5:22])
mCitiS <- data.matrix(CitiBike_miniS)
colnames(mCitiS) <- colnames(CitiBike_miniS)
Subscriber data permutations are provided below and are fast enough that no further RDS object storage is required. Notice that 4 centers were first defined but since the cluster sizes were not well distributed, 5 clusters was defined instead (we had some flexibility to use either 4 or 5 centers based on cluster count analysis as will be described later).
Centers defines the number of clusters to produce as one would expect. Iterations in this case defines the maximum number of iterations allowed for the clustering algorithm to perform; the default number of iterations is 200. The kernel argument defines the kernel function used to compute the affinity matrix within the specc call; the default is 'rbfdot' which is the Radial Basis, or Gaussian, kernel function. Other options provided in our configurations include 'laplacedot' and 'polydot', where 'laplacedot' is the Laplacian kernel function and 'polydot' is the Polynomial kernel function. Finally, nystrom.red is a boolean argument signifying whether to use Nystrom Method to calculate eigenvectors. The default is False, but when changed to True, a sample of the dataset is used to compute the eigenvalues so that only an $n*m$ matrix ($n$ being the sample size) is stored in memory. We use the default number of Nystrom Method samples which is $\frac{nrow}{6}$, or 167 given our data.
%%time
%%R
set.seed(1000)
sc_s <- specc(mCitiS, centers = 4)
%%time
%%R
set.seed(1000)
sc_s6 <- specc(mCitiS, centers = 5)
%%time
%%R
set.seed(1000)
sc_s1 <- specc(mCitiS, centers = 5, iterations = 500)
%%time
%%R
set.seed(1000)
sc_s2 <- specc(mCitiS, centers = 5, iterations = 1000)
%%time
%%R
set.seed(1000)
sc_s3 <- specc(mCitiS, centers = 5, iterations = 1000, kernel = 'laplacedot')
%%time
%%R
set.seed(1000)
sc_s4 <- specc(mCitiS, centers = 5, iterations = 1000, kernel = 'polydot')
%%time
%%R
set.seed(1000)
sc_s5 <- specc(mCitiS, centers = 5, iterations = 1000, kernel = 'rbfdot', nystrom.red = TRUE)
One of the biggest aspects about Kmeans is that it's easy to train the dataset to find local optimums that don't represent the greater dataset. Example is below for each dataset using the inertia results for different initialization values:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from sklearn.utils import shuffle
from sklearn.utils import check_random_state
from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster import KMeans
kmeans_sub_y = CitiBike_S['usertype']
kmeans_sub_X = CitiBike_S[["start_station_latitude","start_station_longitude","end_station_latitude","end_station_longitude","PRCP","SNOW","TAVE","TMAX","TMIN","DayOfWeek_Friday","DayOfWeek_Monday","DayOfWeek_Saturday","DayOfWeek_Sunday","DayOfWeek_Thursday","DayOfWeek_Tuesday","DayOfWeek_Wednesday","TimeOfDay_Afternoon","TimeOfDay_Evening","TimeOfDay_Midday","TimeOfDay_Morning","TimeOfDay_Night","tripdurationLog"]]
kmeans_cus_y = CitiBike_C['usertype']
kmeans_cus_X = CitiBike_C[["start_station_latitude","start_station_longitude","end_station_latitude","end_station_longitude","PRCP","SNOW","TAVE","TMAX","TMIN","DayOfWeek_Friday","DayOfWeek_Monday","DayOfWeek_Saturday","DayOfWeek_Sunday","DayOfWeek_Thursday","DayOfWeek_Tuesday","DayOfWeek_Wednesday","TimeOfDay_Afternoon","TimeOfDay_Evening","TimeOfDay_Midday","TimeOfDay_Morning","TimeOfDay_Night","tripdurationLog"]]
n_runs = 5
n_init_range = np.array([1, 5, 10, 15, 20])
fig = plt.figure()
plots = []
legends = []
cases = [
(KMeans, 'k-means++', {}),
(KMeans, 'random', {})
]
for factory, init, params in cases:
inertia = np.empty((len(n_init_range), n_runs))
for run_id in range(n_runs):
for i, n_init in enumerate(n_init_range):
km = factory(n_clusters=11, init=init, random_state=run_id,
n_init=n_init, **params).fit(kmeans_sub_X)
inertia[i, run_id] = km.inertia_
p = plt.errorbar(n_init_range, inertia.mean(axis=1), inertia.std(axis=1))
plots.append(p[0])
legends.append("%s with %s init" % (factory.__name__, init))
plt.xlabel('n_init')
plt.ylabel('inertia')
plt.legend(plots, legends)
plt.title("Subscribers: Mean inertia for various k-means init across %d runs" % n_runs)

for factory, init, params in cases:
inertia = np.empty((len(n_init_range), n_runs))
for run_id in range(n_runs):
for i, n_init in enumerate(n_init_range):
km = factory(n_clusters=11, init=init, random_state=run_id,
n_init=n_init, **params).fit(kmeans_cus_X)
inertia[i, run_id] = km.inertia_
p = plt.errorbar(n_init_range, inertia.mean(axis=1), inertia.std(axis=1))
plots.append(p[0])
legends.append("%s with %s init" % (factory.__name__, init))
plt.xlabel('n_init')
plt.ylabel('inertia')
plt.legend(plots, legends)
plt.title("Customers: Mean inertia for various k-means init across %d runs" % n_runs)

While the inertia score quickly drops at 5 initializations, it continues to drop at around 20. While not evident here, we also continued out to 50 initializations and found that the inertia was still decreasing. Ultimately we would settle on around 10 initializations for the customer set and 15 for the subscriber set.
Kmeans is a difficult clustering algorithm as it requires a guided approach, first in knowing how many clusters you need to prepare and then ultimately having a good idea of what your dataset is supposed to look like once you've finished preparing your data for modeling.
Our dataset is a complex and primarily qualitative set with over 100 attributes, of which only 22 were identified via PCA in Lab 2 and then used for clustering analysis. Moreover, we're attempting to perform a classification into one of two categories, and while our approach of splitting our dataset into two groups and looking at confidence at how well new observations fit within the observed clusters is novel, it doesn't mesh very well with some of the assumptions for Kmeans. Admittedly, Kmeans didn't require random sampling to be run across our already sampled dataset; it was, however, difficult to measure because of our data's lack of uniformity.
As alluded to previously, DBSCAN is a poorly efficient algorithm for our application. We wanted so much to like it but couldn't compel ourselves to do so. Since the number of clusters are determined based on density metrics eps and minimum core samples, we endeavored to simply generate a reasonable number of customer and subscriber clusters for starters. Sadly enough, this turned into our primary metric for DBSCAN evaluation in the end due to rediculous computation times given our workstation limitations which in turn stressed our project deliverables. Since we also knew the "curse of dimensionality" was not working in our favor in terms of producing valid DBSCAN results on our 18 attribute data sets (remember, coordinates were removed), we deemed it more worthy of our time to move on to Spectral Clustering.
cluster_id_customer = pd.DataFrame ({"Cluster_ID_Customer": db_c7.labels_, "Cluster_ID_Subscriber": np.NaN})
Cust_clus_DB = pd.concat([CitiBike_C.reset_index(), cluster_id_customer], axis=1)
#CitiBike_C = pd.concat([CitiBike_C, cluster_id_customer], axis=1)
Portrayed in the table below are the cluster counts for each permutation of DBSCAN on our customer data. Our directive approach to eps and core sample changes was in the right direction with our first iteration producing a whopping 103 clusters and the last iteration producing 49 clusters (at the significant expense of computation time). Based on the number of clusters produced and our prior knowledge regarding the ride behaviors of customers, 49 clusters is still unreasonably large. Given the $n^2$ computations, time required to complete these computations, and observable increase in time required the fewer number of clusters produced, we chose to stop here.
clus_counts = [db_c.labels_.max() + 1,
db_c1.labels_.max() + 1,
db_c2.labels_.max() + 1,
db_c3.labels_.max() + 1,
db_c4.labels_.max() + 1,
db_c5.labels_.max() + 1,
db_c6.labels_.max() + 1,
db_c7.labels_.max() + 1]
clus_counts = pd.DataFrame({'Cluster_Counts' : clus_counts})
clus_counts.index.names = ['Permutation']
clus_counts.T
Cluster size distributions for the final customer permutation are depicted in the table and histogram below. It is interesting to see that even with so many clusters, there is still significant imbalance in cluster size from one grouping to the next. Given our expectations for this data set, this further proves DBSCAN to be insufficient for our needs. Of good report is there were only 2,570 outliers represented in our customer data (average transactions per cluster were 5,000 when including outliers).
Agg = pd.DataFrame({'count' : Cust_clus_DB.groupby(["Cluster_ID_Customer"]).size()}).reset_index()
display(Agg.T)
print("Min. transactions per cluster (including outliers) =", Agg['count'].min())
print("Max. transactions per cluster (including outliers) =", Agg['count'].max())
print("Average transactions per cluster (including outliers) =", Agg['count'].mean())
Agg.hist('count', color = 'orange', bins = 20)
cluster_id_subscriber = pd.DataFrame ({"Cluster_ID_Customer": np.NaN, "Cluster_ID_Subscriber": db_s8.labels_})
Sub_clus_DB = pd.concat([CitiBike_S.reset_index(), cluster_id_subscriber], axis=1)
#CitiBike_C = pd.concat([CitiBike_C, cluster_id_customer], axis=1)
#Sub_clus_DB.tail(10)
Portrayed in the table below are the cluster counts for each permutation of DBSCAN on our subscriber data. Our directive approach to eps and core sample changes this time was not in the best direction as our first permutation produced 71 clusters and our subsequent permutations increased the number of clusters produced. Though our eps and minimum core samples parameters were eventually corrected, we decided to implement a couple additional permutations as well in attempts to drive the number of clusters down further (our smallest count being 36). Based on the number of clusters produced and our prior knowledge regarding the ride behaviors of subscribers, even 36 clusters is still unreasonably large. Given the $n^2$ computations, time required to complete these computations, and observable increase in time required the fewer number of clusters produced, we chose to stop here.
clus_counts = [db_s.labels_.max() + 1,
db_s1.labels_.max() + 1,
db_s2.labels_.max() + 1,
db_s3.labels_.max() + 1,
db_s4.labels_.max() + 1,
db_s5.labels_.max() + 1,
db_s6.labels_.max() + 1,
db_s7.labels_.max() + 1,
db_s8.labels_.max() + 1,
db_s9.labels_.max() + 1]
clus_counts = pd.DataFrame({'Cluster_Counts' : clus_counts})
clus_counts.index.names = ['Permutation']
clus_counts.T
Cluster size distributions for the permutation producing 36 clusters are depicted in the table and histogram below. It is interesting to see that even with so many clusters, there is still significant imbalance in cluster size from one grouping to the next. The disparity between sizes is quite large for subscriber DBSCAN results as well. For example, the 36th cluster only contains 465 transactions whereas the 5th cluster contains 13,392 and there are a total of 40,669 outliers! While this is the "best" subscriber configuration, it still falls short of our needs.
Agg = pd.DataFrame({'count' : Sub_clus_DB.groupby(["Cluster_ID_Subscriber"]).size()}).reset_index()
display(Agg.T)
print("Min. transactions per cluster (including outliers) =", Agg['count'].min())
print("Max. transactions per cluster (including outliers) =", Agg['count'].max())
print("Average transactions per cluster (including outliers) =", Agg['count'].mean())
Agg.hist('count', color = 'violet', bins = 20)
Since we are trying to identify underlying traits exhibited within customer and subscriber classes across 22 attributes, we went into our spectral clustering methodology with high hopes. Fortunately, our results did not disappoint. By producing the Laplacian matrix and its associated eigenvectors, we were able to cluster data within a spectral space rather than the feature space.
Unlike DBSCAN and more like K-means, we were required to identify upfront an appropriate number of clusters to have the algorithm produce. Unlike K-means, however, this was done by analyzing the eigenvalue spectrum for customer and subscriber data sets and locating the point at which the largest gap occurs between eigenvalues for each cluster count. This will be described in further detail in the respective customer/subscriber sections that follow.
Before implementing our Spectral Clustering permutations in the Modeling and Evaluation Part 1 - Train and adjust parameters section above, we calculated Similarity, Affinity, Degree, and Laplacian matrices manually. We also calculated Laplacian matrix eigenvalues at that time. This preparation was all meant to lead up to the eigenvalue spectrum plot displayed below.
In this plot cluster counts are displayed on the x-axis with Laplacian eigenvalues on the y-axis. The dotted red line indicates the largest gap between consecutive eigenvalues and is generally the point at which the lower of the gap values is selected. Therefore, we identified 3 clusters as being the optimum count for our customer data and 3 is what we entered into our Spectral Clustering permutations earlier. More information regarding our approach is available in Ulrike von Luxburg's paper, A Tutorial on Spectral Clustering.
%%time
%%R
plot(1:10, rev(evLc$values)[1:10], xlab = "Customer Clusters", ylab = "Laplacian Eigenvalues", main = "Customer Eigenvalue Spectrum")
abline(v=3.5, col="red", lty=2) # largest eigenvalue gap between 3 and 4 clusters
Before jumping straight to our permutations, however, we did take interest in comparing within sum of squares while varying cluster counts. Below is a plot depicting the within sum of squares with each number of clusters. Note the only parameter that is being changed in our function call is the number of clusters.
As can be seen, while 3 clusters does not produce the smallest within sum of squares, its Wss is the third smallest. This seems to be good compromise given the performance of the eignevalues above.
%%time
%%R
set.seed(100)
wss <- 0
wss[1] <- ((length(mCitiC)/length(colnames(mCitiC)))-1)*sum(apply(mCitiC,2,var))
for (i in 2:22) wss[i] <- sum(withinss(specc(mCitiC, centers = i)))
plot(1:22, wss, type="b", xlab="Number of Customer Clusters", ylab="Within groups sum of squares", main = "Customer WSS", col = 'blue')
%%time
%%R
wss <- c(sum(withinss(sc_c)),
sum(withinss(sc_c1)),
sum(withinss(sc_c2)),
sum(withinss(sc_c3)),
sum(withinss(sc_c4)),
sum(withinss(sc_c5)))
siz <- c(paste(size(sc_c), sep = '', collapse = ', '),
paste(size(sc_c1), sep = '', collapse = ', '),
paste(size(sc_c2), sep = '', collapse = ', '),
paste(size(sc_c3), sep = '', collapse = ', '),
paste(size(sc_c4), sep = '', collapse = ', '),
paste(size(sc_c5), sep = '', collapse = ', '))
its <- c(200,500,1000,1000,1000,1000)
cen <- c(3,3,3,3,3,3)
ker <- c("rbfdot", "rbfdot", "rbfdot", "laplacedot", "polydot", "rbfdot")
nys <- c(FALSE, FALSE, FALSE, FALSE, FALSE, TRUE)
Cust_SC <- data.frame(cen, its, ker, nys, wss, siz)
Cust_SC <- setNames(Cust_SC, c("Cluster.Count", "Iterations", "Kernel", "Nystrom.Method", "wss", "Sizes"))
The theme of using within sum of squares continues as we now review the outcome of each tried configuration. Surprisingly, there was no difference in Wss or cluster size distributions as displayed in the table below (ordered by smallest to largest Wss). The only deviation from a Wss value of 24,927 was when choosing to implement the Nystrom Method (Wss = 22,842). Again, the Nystrom Method further samples the data when calculating the Laplacian eigenvectors. Interestingly enough, this approach not only produced the smallest Wss but also produced the most evenly distributed cluster sizes.
%%time
%R Cust_SC[order(wss),]
Within sum of squares between each permutation are again depicted below in line chart form. Notice all Wss values are the same except the 6th permutation method's which utilizes Nystrom's Method for eigenvector calculation.
Cust_SC = %R Cust_SC
fig = plt.figure()
plt.plot(Cust_SC.index.values, Cust_SC.wss, 'ro-', linewidth=2)
plt.title('Total Within Sum of Squares')
plt.xlabel('Configuration')
plt.ylabel('Total Wss')
plt.annotate('Selected Configuration', xy=(6, 22841), xytext=(4.2, 22550),
arrowprops=dict(facecolor='blue', shrink=0.05),)
plt.show()
Another note about the Nystrom Method is its speed. While running our algorithm without Nystrom's Method enabled, our runtimes averaged about 30 seconds. Our algorithm with Nystrom's Method enabled averaged between only 400ms and 600ms. This is approximately a 98.3% reduction in runtime while still producing the best results! So 3 clusters, 1000 iterations, the Radial Basis kernel, and Nystrom Method enabled takes the clear win among customer Spectral Clustering permutations.
Similar to customer spectral clustering, we had calculated Similarity, Affinity, Degree, and Laplacian matrices manually in the Modeling and Evaluation Part 1 - Train and adjust parameters section along with the Laplacian matrix eigenvalues. This, too, led up to the eigenvalue spectrum plot displayed as follows.
Subscriber Laplacian eigenvalues were a bit different from customer eigenvalues in that substantial gaps exist between both 4 and 5 clusters and 5 and 6 clusters. This means that either the first 5 or the first 6 eigenvectors contain all the cluster information needed to describe the subscriber data. For this reason we tried both counts in our subscriber permutations.
%%time
%%R
plot(1:10, rev(evLs$values)[1:10], xlab = "Subscriber Clusters", ylab = "Laplacian Eigenvalues", log = "y", main = "Subscriber Eigenvalue Spectrum")
abline(v=4.5, col="red", lty=2) # largest eigenvalue gap between 4 and 5 clusters
abline(v=5.5, col="blue", lty=3) # Potential gap between 5 and 6 clusters too
Again, we wanted to review within sum of squares while varying cluster counts before proceeding with trying our different parameters. Below is a plot depicting the within sum of squares with each number of clusters. Note the only parameter that is being changed in our function call is the number of clusters.
As can be seen, both 4 and 5 clusters exemplify relatively small Wss values but are still larger than 2 or 3 clusters. It's also interesting to observe there is a dip in Wss values between 3 and 4 clusters before increasing again with 5 clusters. Again, while it is interesting to review this relationship between Wss and cluster count, our real strategy with Spectral Clustering is to select how many clusters to use based on the Laplacian eigenvalues.
%%time
%%R
set.seed(100)
wss <- 0
wss[1] <- ((length(mCitiS)/length(colnames(mCitiS)))-1)*sum(apply(mCitiS,2,var))
for (i in 2:22) wss[i] <- sum(withinss(specc(mCitiS, centers = i)))
plot(1:22, wss, type="b", xlab="Number of Subscriber Clusters", ylab="Within groups sum of squares", main = "Subscriber WSS", col = 'red')
%%time
%%R
wss <- c(sum(withinss(sc_s)),
sum(withinss(sc_s6)),
sum(withinss(sc_s1)),
sum(withinss(sc_s2)),
sum(withinss(sc_s3)),
sum(withinss(sc_s4)),
sum(withinss(sc_s5)))
siz <- c(paste(size(sc_s), sep = '', collapse = ', '),
paste(size(sc_s6), sep = '', collapse = ', '),
paste(size(sc_s1), sep = '', collapse = ', '),
paste(size(sc_s2), sep = '', collapse = ', '),
paste(size(sc_s3), sep = '', collapse = ', '),
paste(size(sc_s4), sep = '', collapse = ', '),
paste(size(sc_s5), sep = '', collapse = ', '))
its <- c(200,200,500,1000,1000,1000,1000)
cen <- c(4,5,5,5,5,5,5)
ker <- c("rbfdot", "rbfdot", "rbfdot", "rbfdot", "laplacedot", "polydot", "rbfdot")
nys <- c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE)
Sub_SC <- data.frame(cen, its, ker, nys, wss, siz)
Sub_SC <- setNames(Sub_SC, c("Cluster.Count", "Iterations", "Kernel", "Nystrom.Method", "wss", "Sizes"))
Again, with an exception made for the Nystrom Method configuration, there was no difference in Wss or cluster size distributions as displayed in the table below for 5 clusters. There was, however, an expected difference between 4 and 5 cluster Wss values. What is unexpected is that the 4 cluster Wss value is larger than the 5 cluster value. This may be due to the randomized sampling method utilized by the R specc library. Regardless, the Nystrom Method still outperforms all other configurations in both Wss value and cluster size distributions.
%%time
%R Sub_SC[order(wss),]
Within sum of squares between each permutation are again depicted below in line chart form. Nystrom's Method clearly takes the win for smallest Wss value.
Sub_SC = %R Sub_SC
fig = plt.figure()
plt.plot(Sub_SC.index.values, Sub_SC.wss, 'ro-', linewidth=2)
plt.title('Total Within Sum of Squares')
plt.xlabel('Configuration')
plt.ylabel('Total Wss')
plt.annotate('Selected Configuration', xy=(7, 26926), xytext=(5.2, 26550),
arrowprops=dict(facecolor='blue', shrink=0.05),)
plt.show()
Again worth noting is the Nystrom Method's speed. While running our algorithm without Nystrom's Method enabled, our runtimes averaged about 30 seconds. Our algorithm with Nystrom's Method enabled averaged between only 400ms and 600ms. This is approximately a 98.3% reduction in runtime while still producing the best results! So 5 clusters, 1000 iterations, the Radial Basis kernel, and Nystrom Method enabled takes the clear win among subscriber Spectral Clustering permutations.
Based on the evaluations above, we have identified Spectral Clustering with 3 clusters, 1000 iterations, the Radial Basis kernel implemented, and Nystrom Method enabled as the clear winner for customer data and 5 clusters, 1000 iterations, the Radial Basis kernel implemented, and Nystrom Method enabled as the clear winner for subscriber data. It produces the smallest within sum of squares in the shortest amount of time. While all our other clustering methods suffer from the "curse of dimensionality," the effects of large dimensionality in the feature space are significantly mitigated in the spectral space. All this is accomplished while producing a small, manageable amount of evenly sized clusters.
Below, we merge the new cluster IDs with the scaled CitiBike customer and subscriber data and pickle the dataframe for further use as described in our Deployment section.
As the final Spectral Clustering permutation is the only one we care about in the end, it will be our primary focus as we visualize cluster results in the Modeling and Evaluation Part 3 - Visualize Results section.
%R Cust_clus <- cbind(CitiBike_miniC, Cluster_ID_Customer = sc_c5, Cluster_ID_Subscriber = NaN)
clear_display()
Cust_clus = %R Cust_clus
if os.path.isfile("PickleFiles/Cust_clus.pkl"):
print("File already created")
else: pickleObject(Cust_clus, "Cust_clus", filepath = "PickleFiles/")
%R Sub_clus <- cbind(CitiBike_miniS, Cluster_ID_Customer = NaN, Cluster_ID_Subscriber = sc_s5)
clear_display()
Sub_clus = %R Sub_clus
if os.path.isfile("PickleFiles/Sub_clus.pkl"):
print("File already created")
else: pickleObject(Sub_clus, "Sub_clus", filepath = "PickleFiles/")
With spectral clustering selected as our winner, we wanted visualize these identified clusters across the correlation between a few key attributes. Originally, during this process, we plotted all clusters at the same time with different colors representing clusters on the plot. When doing this, we quickly identified that this was not a feasible way to interpret the clusters because of how many overlapping data points we saw between clusters. (Example below)

These overlapping data points and noise produced is likely due to the highly dimensional inputs utilized for our cluster model fit. Limitations of 3D visualizations do not give us nearly enough to truly see obvious clusters produced simply because we do not have the entire picture in the graph.
Albeit we have encountered these challenges, we have decided to focus on 3D scatterplots for the relationship TripDurationLog has on Location and Weather data in respect to cluster assignments for both customer fit clusters and subscriber fit clusters. Given the noisy plots produced, we decided to subplot these visualizations by cluster id. This allows us to compare the clusters side by side instead of overlayed on top of each other. For each correlation 3d plot produced, we analyze the results from three different views by rotating x, y, and z axes. This helps us to see the data from different angles and elevations, potentially revealing additional insights.
Sub_clus = unpickleObject("Sub_clus")
Cust_clus = unpickleObject("Cust_clus")
def scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el):
#fig = plt.figure()
#ax = fig.gca(projection='3d')
ax.scatter(x, y, z, c=cs, cmap = 'prism', zdir = 'z')
ax.set_xlabel(xLab)
ax.set_ylabel(yLab)
ax.set_zlabel(zLab)
ax.set_title(title)
ax.set_xlim(xLim)
ax.set_ylim(yLim)
ax.set_zlim(zLim)
ax.view_init(azim=rot, elev = el) # Set rotation angle
Location by Trip Duration (Log) - Coordinate View
With X and Y axes as location coordinates, we were able to look at the data in a "Top-down" approach with high elevation in order to see if clusters appeared to target specific locations in NYC. We identified that clusters do not vary much in the northwest regions of the city, but we do see some slight variation in clusters in the southeast region. Cluster 1 appears to have data throughout the whole city, Cluster 2 has very sparse data in the southeast region and very dense data in the rest of the city, and finally Cluster 3 contains nearly no data at all in the southeast region of the city.
custdistclusters = Cust_clus.Cluster_ID_Customer.drop_duplicates().sort_values().values
plt.rcParams['figure.figsize'] = (18, 9)
fig = plt.figure()
for i in range(0,len(custdistclusters)):
x = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].start_station_longitude
y = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].start_station_latitude
z = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].tripdurationLog
cs = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].Cluster_ID_Customer
xLim = (Cust_clus.start_station_longitude.min(), Cust_clus.start_station_longitude.max())
yLim = (Cust_clus.start_station_latitude.min(), Cust_clus.start_station_latitude.max())
zLim = (Cust_clus.tripdurationLog.min(), Cust_clus.tripdurationLog.max())
xLab = 'Longitude (scaled)'
yLab = 'Latitude (scaled)'
zLab = 'Trip Duration Log (scaled)'
rot = -90
el = 80
title = "Customer Cluster {0}".format(i+1)
ax = fig.add_subplot(2,3,i + 1, projection='3d')
scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el)
plt.show()
Location by Trip Duration (Log) - Longitude View
Plotting the data at a 90 degree angle and very slight elevations lets us assess trip durations as start location coordinates move east to west across the map. We see pretty consistent variance in trip duration (log) from east coordinates to west coordinates within each cluster. This may suggest that location within these clusters does not play a huge role in relation to the trip duration. However, when looking at the variance of plotted points across clusters, there is a clear difference between the clusters. Cluster 1 contains trip duration (log) values ranging +/-4, Cluster 2 ranging +/-3, and Cluster 3 ranging +/-2. Interestingly, these differences in variance do not prevent overlapping data values ranges to occur. This suggests that other factors are at play, not plotted here which help to differentiate clusters. We also, once again, see the east coordinate data values dissapear in the third cluster.
plt.rcParams['figure.figsize'] = (18, 9)
fig = plt.figure()
for i in range(0,len(custdistclusters)):
x = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].start_station_longitude
y = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].start_station_latitude
z = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].tripdurationLog
cs = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].Cluster_ID_Customer
xLim = (Cust_clus.start_station_longitude.min(), Cust_clus.start_station_longitude.max())
yLim = (Cust_clus.start_station_latitude.min(), Cust_clus.start_station_latitude.max())
zLim = (Cust_clus.tripdurationLog.min(), Cust_clus.tripdurationLog.max())
xLab = 'Longitude (scaled)'
yLab = 'Latitude (scaled)'
zLab = 'Trip Duration Log (scaled)'
rot = -90
el = 10
title = "Customer Cluster {0}".format(i+1)
ax = fig.add_subplot(2,3,i + 1, projection='3d')
scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el)
plt.show()
Location by Trip Duration (Log) - 3D View
This view, rotated at 60 degrees and an elevation of 45, was definitely the most difficult to interpret. We once again see the consistent variance seen within each cluster previously and the decreasing variance across clusters as you move from cluster 1 to 3. Further insights from this visualization are difficult to ascertain, given the missing dimensions utilized in the model fit.
plt.rcParams['figure.figsize'] = (18, 9)
fig = plt.figure()
for i in range(0,len(custdistclusters)):
x = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].start_station_longitude
y = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].start_station_latitude
z = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].tripdurationLog
cs = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].Cluster_ID_Customer
xLim = (Cust_clus.start_station_longitude.min(), Cust_clus.start_station_longitude.max())
yLim = (Cust_clus.start_station_latitude.min(), Cust_clus.start_station_latitude.max())
zLim = (Cust_clus.tripdurationLog.min(), Cust_clus.tripdurationLog.max())
xLab = 'Longitude (scaled)'
yLab = 'Latitude (scaled)'
zLab = 'Trip Duration Log (scaled)'
rot = -60
el = 45
title = "Customer Cluster {0}".format(i+1)
ax = fig.add_subplot(2,3,i + 1, projection='3d')
scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el)
plt.show()
Weather by Trip Duration (Log) - Weather View
We have chosen to assess Weather attributes of Average Temperature and Precipitation against trip duration, to identify interesting trends / features across clusters. In our first view, we plot high elevation in order to focus on the presence of data across clusters in varying weather circumstances. Given the fact that majority of rides are taken during non-rainy days, all three clusters contain dense regions around 0 precipitation and warmer temperatures. The interesting takeaway from this data, identifies the different "tolerance" levels of a customer rider. As you move from cluster 1 to cluster 3, you see the tolerance for bad weather (both cold temperatures and high precipitation tolerance) decrease. This may suggest that Customers within cluster 1 are more intense riders. One could speculate that the higher tolerance you have to bad weather as a customer, the more likely you may be to become a subscribing user. The correlation of weather attributes across clusters may be an important factor to identifying our target user segment.
plt.rcParams['figure.figsize'] = (18, 9)
fig = plt.figure()
for i in range(0,len(custdistclusters)):
x = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].PRCP
y = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].TAVE
z = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].tripdurationLog
cs = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].Cluster_ID_Customer
xLim = (Cust_clus.PRCP.min(), Cust_clus.PRCP.max())
yLim = (Cust_clus.TAVE.min(), Cust_clus.TAVE.max())
zLim = (Cust_clus.tripdurationLog.min(), Cust_clus.tripdurationLog.max())
xLab = 'Precipitation (scaled)'
yLab = 'Avg Temp (scaled)'
zLab = 'Trip Duration Log (scaled))'
rot = -90
el = 80
title = "Customer Cluster {0}".format(i+1)
ax = fig.add_subplot(2,3,i + 1, projection='3d')
scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el)
plt.show()
Weather by Trip Duration (Log) - Rain View
Once again, we choose to plot the data at a 90 degree angle and very slight elevation. As discussed in the previous view, we see the same trend occuring with precipitation tolerance decreasing across clusters. In addition to this tolerance factor, we also see another interesting difference between clusters, as the range of trip duration values is much higher in clusters with larger weather tolerance. This supports the theory provided earlier, in which cluster 1 observations may be good target candidates.
plt.rcParams['figure.figsize'] = (18, 9)
fig = plt.figure()
for i in range(0,len(custdistclusters)):
x = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].PRCP
y = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].TAVE
z = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].tripdurationLog
cs = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].Cluster_ID_Customer
xLim = (Cust_clus.PRCP.min(), Cust_clus.PRCP.max())
yLim = (Cust_clus.TAVE.min(), Cust_clus.TAVE.max())
zLim = (Cust_clus.tripdurationLog.min(), Cust_clus.tripdurationLog.max())
xLab = 'Precipitation (scaled)'
yLab = 'Avg Temp (scaled)'
zLab = 'Trip Duration Log (scaled))'
rot = -90
el = 10
title = "Customer Cluster {0}".format(i+1)
ax = fig.add_subplot(2,3,i + 1, projection='3d')
scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el)
plt.show()
Weather by Trip Duration (Log) - 3D View
Finally, our 3D rendered view, depicting all three attributes, provides further backing to our previous statements. The density around "optimal weather" conditions is definitely much tighter in cluster 3 in terms of both precipitation and temperatures than that of cluster 2 or cluster 1. The best way to describe these attributes accross clusters is simply levels of tolerance to weather conditions with 1 being highest tolerance(will ride no matter what) and 3 being lowest tolerance(will ride in optimal conditions).
plt.rcParams['figure.figsize'] = (18, 9)
fig = plt.figure()
for i in range(0,len(custdistclusters)):
x = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].PRCP
y = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].TAVE
z = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].tripdurationLog
cs = Cust_clus[Cust_clus["Cluster_ID_Customer"] == i + 1].Cluster_ID_Customer
xLim = (Cust_clus.PRCP.min(), Cust_clus.PRCP.max())
yLim = (Cust_clus.TAVE.min(), Cust_clus.TAVE.max())
zLim = (Cust_clus.tripdurationLog.min(), Cust_clus.tripdurationLog.max())
xLab = 'Precipitation (scaled)'
yLab = 'Avg Temp (scaled)'
zLab = 'Trip Duration Log (scaled))'
rot = -60
el = 45
title = "Customer Cluster {0}".format(i+1)
ax = fig.add_subplot(2,3,i + 1, projection='3d')
scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el)
plt.show()
Location by Trip Duration (Log) - Coordinate View
With X and Y axes as location coordinates, we were able to look at the data in a "Top-down" approach with high elevation in order to see if clusters appeared to target specific locations in NYC. With 5 different clusters, we thought we may see much more differentiation by location coordinates than that of the customer clusters. There seems to be a slightly noticable difference between location demographic within Cluster 2 in comparison to others, however the difference is very subtle and potentially caused by the difference in sample size within clusters. All other clusters resemble similar outcomes to one another and no true insights may be drawn as to what locations resemble which clusters.
subdistclusters = Sub_clus.Cluster_ID_Subscriber.drop_duplicates().sort_values().values
plt.rcParams['figure.figsize'] = (18, 9)
fig = plt.figure()
for i in range(0,len(subdistclusters)):
x = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].start_station_longitude
y = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].start_station_latitude
z = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].tripdurationLog
cs = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].Cluster_ID_Subscriber
xLim = (Sub_clus.start_station_longitude.min(), Sub_clus.start_station_longitude.max())
yLim = (Sub_clus.start_station_latitude.min(), Sub_clus.start_station_latitude.max())
zLim = (Sub_clus.tripdurationLog.min(), Sub_clus.tripdurationLog.max())
xLab = 'Longitude (scaled)'
yLab = 'Latitude (scaled)'
zLab = 'Trip Duration Log (scaled)'
rot = -90
el = 80
title = "Subscriber Cluster {0}".format(i+1)
ax = fig.add_subplot(2,3,i + 1, projection='3d')
scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el)
plt.show()
Location by Trip Duration (Log) - Longitude View
Plotting the data at a 90 degree angle and very slight elevations lets us assess trip durations as start location coordinates move east to west across the map. Similar to the customer clusters, we see pretty consistent variance in trip duration (log) from east coordinates to west coordinates within each cluster. One may speculate that these location coordinates may be minimally impactful in our cluster fits and potentially in identifying usertypes as well. Variance in trip duration across clusters appears to change minimally, however it seems relatively unphased by the actual start location.
plt.rcParams['figure.figsize'] = (18, 9)
fig = plt.figure()
for i in range(0,len(subdistclusters)):
x = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].start_station_longitude
y = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].start_station_latitude
z = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].tripdurationLog
cs = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].Cluster_ID_Subscriber
xLim = (Sub_clus.start_station_longitude.min(), Sub_clus.start_station_longitude.max())
yLim = (Sub_clus.start_station_latitude.min(), Sub_clus.start_station_latitude.max())
zLim = (Sub_clus.tripdurationLog.min(), Sub_clus.tripdurationLog.max())
xLab = 'Longitude (scaled)'
yLab = 'Latitude (scaled)'
zLab = 'Trip Duration Log (scaled)'
rot = -90
el = 10
title = "Subscriber Cluster {0}".format(i+1)
ax = fig.add_subplot(2,3,i + 1, projection='3d')
scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el)
plt.show()
Location by Trip Duration (Log) - 3D View
This view, rotated at 60 degrees and an elevation of 45, was once again the most difficult to interpret. We see the consistent variance seen within each cluster as was seen previously, as well as the differing trip duration variance across clusters as you move across clusters. Similar to our customer visualizations by location, further insights are difficult to ascertain, given the missing dimensions utilized in the model fit.
plt.rcParams['figure.figsize'] = (18, 9)
fig = plt.figure()
for i in range(0,len(subdistclusters)):
x = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].start_station_longitude
y = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].start_station_latitude
z = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].tripdurationLog
cs = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].Cluster_ID_Subscriber
xLim = (Sub_clus.start_station_longitude.min(), Sub_clus.start_station_longitude.max())
yLim = (Sub_clus.start_station_latitude.min(), Sub_clus.start_station_latitude.max())
zLim = (Sub_clus.tripdurationLog.min(), Sub_clus.tripdurationLog.max())
xLab = 'Longitude (scaled)'
yLab = 'Latitude (scaled)'
zLab = 'Trip Duration Log (scaled)'
rot = -60
el = 45
title = "Subscriber Cluster {0}".format(i+1)
ax = fig.add_subplot(2,3,i + 1, projection='3d')
scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el)
plt.show()
Weather by Trip Duration (Log) - Weather View
Once again, we plot high elevation in order to focus on the presence of data across clusters in varying weather circumstances. Interestingly, there is a very clear differentiation between clusters in regards to temperatures. Both clusters 1 and 2, specifically target those subscribing members with a high tolerance to cold weather, whereas Clusters 3,4,5 all focus on warmer weather. This is much different resulting cluster features than what was found in the customer clusters, because of the hard break between clusters. As you look through the clusters targeting warmer weather, you can see a difference in precipitation tolerance with Cluster 4 having largest tolerance to rain and cluster 5 with the least.
plt.rcParams['figure.figsize'] = (18, 9)
fig = plt.figure()
for i in range(0,len(subdistclusters)):
x = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].PRCP
y = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].TAVE
z = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].tripdurationLog
cs = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].Cluster_ID_Subscriber
xLim = (Sub_clus.PRCP.min(), Sub_clus.PRCP.max())
yLim = (Sub_clus.TAVE.min(), Sub_clus.TAVE.max())
zLim = (Sub_clus.tripdurationLog.min(), Sub_clus.tripdurationLog.max())
xLab = 'Precipitation (scaled)'
yLab = 'Avg Temp (scaled)'
zLab = 'Trip Duration Log (scaled))'
rot = -90
el = 80
title = "Subscriber Cluster {0}".format(i+1)
ax = fig.add_subplot(2,3,i + 1, projection='3d')
scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el)
plt.show()
Weather by Trip Duration (Log) - 3D View
Although we have identified cluster trends by weather in our subscriber clusters, there does not appear to be much significance in trip duration across clusters. This may be due to the "purpose" subscribers are utilizing the service for. Regardless of rain or shine, our subscriber users are likely making the same routine trips which typically take the same amount of time each trip taken. As was determined by the customer cluster visualizations, we depict these clusters as identifying different tolerance levels amongst weather conditions. With the clear break in clusters present by temperature, and the different levels of precipitation tolerance amongst warmer weather rides - this appears to be a strong component in identifying these clusters.
plt.rcParams['figure.figsize'] = (18, 9)
fig = plt.figure()
for i in range(0,len(subdistclusters)):
x = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].PRCP
y = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].TAVE
z = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].tripdurationLog
cs = Sub_clus[Sub_clus["Cluster_ID_Subscriber"] == i + 1].Cluster_ID_Subscriber
xLim = (Sub_clus.PRCP.min(), Sub_clus.PRCP.max())
yLim = (Sub_clus.TAVE.min(), Sub_clus.TAVE.max())
zLim = (Sub_clus.tripdurationLog.min(), Sub_clus.tripdurationLog.max())
xLab = 'Precipitation (scaled)'
yLab = 'Avg Temp (scaled)'
zLab = 'Trip Duration Log (scaled))'
rot = -60
el = 45
title = "Subscriber Cluster {0}".format(i+1)
ax = fig.add_subplot(2,3,i + 1, projection='3d')
scatter3d(ax,title,x,y,z,cs,xLab,yLab,zLab,xLim,yLim,zLim,rot,el)
plt.show()
Visualization Summary
Ultimately, given our highly dimensional cluster model fit inputs, it was very difficult to understand what makes a cluster unique from the others. There are many attributes left un-touched in these visualizations due to the limitations on our 3D plots. Within the attributes analyzed, we have come to the conclusion that location does not play a super large role in identifying cluster locations, with the exception of the southeast region of the city. Also, we have determined that one of the core components of the cluster segments is weather tolerance during a ride. Although this factor was deemed important, there are still several other attributes utilized when building these clusters and the interactions amongst these other factors could shed light on many other insights about our cluster segments produced.
As iterated several times over by this point, when pitting K-means, DBSCAN, and Spectral Clustering against one another, Spectral Clustering takes the win for both our customer and subscriber data sets. Reviewing the number of clusters alone, customer and subscriber counts were 7 and 11 for K-means, 49 and 36 for DBSCAN, and 3 and 5 clusters respectively for Spectral Clustering. DBSCAN was simply far too inefficient when including all clustering attributes (stations coordinates excluded) and time consuming, especially as the number of clusters produced was decreased. Then, when comparing the results of K-means against Spectral Clustering, clusters produced by Spectral Clustering exhibited the best cluster cohesion. Getting more specific still, implementing our Spectral Clustering algorithms using Nystrom's Method for eigenvector computation produced truly the best clustering results out of all configurations tested, from both a cohesion and resource perspective.
But what are the ramifications of choosing this Spectral Clustering configuration? The biggest pitfall is that extreme sampling was required in order to make Spectral Clustering possible from a resource perspective. This means clustering for most transactions still remains. In addition to the sampling required, our decision to split the data set back into separate customer and subscriber data sets circumvents our business need to identify would-be/should-be subscribers among customer riders. We recognize the deficiency here but have fabricated a means for mitigating the effects thereof via classification on produced clusters (explored throughout the remainder of this section).
In addition to executing cluster classification for missing values, we will also perform PCA analysis, similar to how we did during Lab 2, using these newly generated groupings. Throughout this additional analysis, we hope to extend description of our results and convey even more compelling findings than we have already throughout Modeling and Evaluation Parts 1-3.
With our clusters fit and appended to the original data, we still do not know what customer cluster our subscriber observations are fit to and vice versa. In order to fit clusters for the inverse user type, we need to build a classification model that can predict the appropriate cluster with strong confidence. For these models, when focusing on accuracy we’ll primarily use confusion matrices to explore our results alongside accuracy percentiles.
We have chosen to utilize Stratified KFold Cross Validation for our classification analysis, with 5 folds. This means, that from our original sample size of 1000, each "fold" will save off approximately 20% as test observations utilizing the rest as training observations all while keeping the ratio of classes equal amongst clusters. This process will occur through 5 iterations, or folds, to allow us to cross validate our results amongst different test/train combinations. We have utilized a random_state seed equal to the length of the original sampled dataset to ensure reproducible results.
fullunpicklepath = "PickleFiles\Sub_clus.pkl"
# Create an variable to pickle and open it in write mode
unpicklefile = open(fullunpicklepath, 'rb')
CitiBike_S_Clust = pickle.load(unpicklefile)
unpicklefile.close()
fullunpicklepath = "PickleFiles\Cust_clus.pkl"
# Create an variable to pickle and open it in write mode
unpicklefile = open(fullunpicklepath, 'rb')
CitiBike_C_Clust = pickle.load(unpicklefile)
unpicklefile.close()
# Create CV Object for StratifiedKFold with 10 Folds, seeded at the length of our sample size
seed = len(CitiBike_S_Clust)
cv = StratifiedKFold(n_splits = 10, random_state = seed)
print(cv)
Max Depth The maximum depth (levels) in the tree. When a value is set, the tree may not split further once this level has been met regardless of how many nodes are in the leaf.
Max Features Number of features to consider when looking for a split. Values tested ranged from the default up until the maximum number of features passed in.
Minimum Samples in Leaf Minimum number of samples required to be in a leaf node. Splits may not occur which cause the number of samples in a leaf to be less than this value. Too low a value here leads to overfitting the tree to train data.
Minimum Samples to Split Minimum number of samples required to split a node. Care was taken during parameter tests to keep the ratio between Min Samples in Leaf and Min Samples to Split equal to that of the default values (1:2). This was done to allow an even 50/50 split on nodes which match the lowest granularity split criteria. Similar to the min samples in leaf, too low a value here leads to overfitting the tree to train data.
n_estimators Number of Trees generated in the forest. Increasing the number of trees, in our models increased accuracy while decreasing performance. We tuned to provide output that completed all 10 iterations in under 10 minutes.
%%time
def rfc_explor(ScaledData,
n_estimators,
max_features,
max_depth,
min_samples_split,
min_samples_leaf,
y,
cv = cv,
seed = seed):
startTime = datetime.now()
X = ScaledData
rfc_clf = RandomForestClassifier(n_estimators=n_estimators, max_features = max_features, max_depth=max_depth, min_samples_split = min_samples_split, min_samples_leaf = min_samples_leaf, n_jobs=-1, random_state = seed) # get object
accuracy = cross_val_score(rfc_clf, X, y, cv=cv.split(X, y)) # this also can help with parallelism
MeanAccuracy = sum(accuracy)/len(accuracy)
accuracy = np.append(accuracy, MeanAccuracy)
endTime = datetime.now()
TotalTime = endTime - startTime
accuracy = np.append(accuracy, TotalTime)
#print(TotalTime)
#print(accuracy)
return accuracy
def plot_confusion_matrix(cm,
classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
plt.rcParams['figure.figsize'] = (12, 6)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, round(cm[i, j],4),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
%%time
def compute_kfold_scores_Classification( clf,
ScaledData,
y,
classes,
cv = cv):
X = ScaledData.as_matrix()
# Run classifier with cross-validation
accuracy = []
for (train, test) in cv.split(X, y):
clf.fit(X[train],y[train]) # train object
y_hat = clf.predict(X[test]) # get test set preditions
a = float(mt.accuracy_score(y[test],y_hat))
accuracy.append(round(a,5))
print("Accuracy Ratings across all iterations: {0}\n\n\
Average Accuracy: {1}\n".format(accuracy, round(sum(accuracy)/len(accuracy),5)))
print("confusion matrix\n{0}\n".format(pd.crosstab(y[test], y_hat, rownames = ['True'], colnames = ['Predicted'], margins = True)))
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(confusion_matrix(y[test], y_hat),
classes =classes,
normalize =True,
title ='Confusion matrix, with normalization')
return clf, accuracy
Train Cluster Parameters for Subscribers
After 14 iterations of modifying the above parameters, we land on a final winner based on the highest average Accuracy value across all iterations. Average Accuracy values in our 10 test/train iterations ranged from 87.6965 % with our worst parameter inputs of the random forest classification model to a value of 89.3968 % in the best tuned model fit. Also shown below is a stacked line chart of all accuracies across the 10 iterations, with different lines per parameter input. For all inputs provided, we saw less consistency than was desired - in our winning parameter inputs, we saw an accuracy metric ranging from 86 % - 94.0594 %. Part of this problem could be due to the very small sample size fit through spectral clustering. With more computational resources and time, we could have created a larger fit dataset to these clusters for training/testing operations. Given this report's focus on Clustering, and not on classification we have chosen to only train one model type. In a real-world setting, we would test various other classification methodologies to identify a model with the best consistent fit.
Parameter inputs for the final Random Forest Classification model with the KD Tree Algorithm are as follows:
Subscriber Parameters
| max_depth | max_features | min_samples_leaf | min_samples_split | n_estimators |
|---|---|---|---|---|
| None | Auto | 1 | 2 | 15 |
%%time
#os.remove("PickleFiles/rfcdf_Subsc.pkl")
if os.path.isfile("PickleFiles/rfcdf_Subsc.pkl"):
print("File already created")
rfcdf_Subsc = unpickleObject("rfcdf_Subsc")
else:
acclist = []
n_estimators = [10 , 10 , 10 , 10 , 10 , 10 , 10 , 10 , 10 , 10 , 10 , 5 , 15 , 25 ]
max_features = ['auto', 'auto', 8 , 12 , 16 , 20 , 22 , 'auto' , 'auto' , 'auto' , 'auto', 'auto' , 'auto' , 'auto' ]
max_depth = [None , None , None , None, None , None, None, 100 , 50 , 25 , 10 , None , None , None ]
min_samples_split = [2 , 8 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 ]
min_samples_leaf = [1 , 4 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 ]
for i in range(0,len(n_estimators)):
acclist.append(rfc_explor(ScaledData = CitiBike_S_Clust.drop(["usertype","Cluster_ID_Customer", "Cluster_ID_Subscriber"], axis=1),
n_estimators = n_estimators[i],
max_features = max_features[i],
max_depth = max_depth[i],
min_samples_split = min_samples_split[i],
min_samples_leaf = min_samples_leaf[i],
y = CitiBike_S_Clust["Cluster_ID_Subscriber"].values
)
)
rfcdf_Subsc = pd.DataFrame(pd.concat([pd.DataFrame({
"n_estimators": n_estimators,
"max_features": max_features,
"max_depth": max_depth,
"min_samples_split": min_samples_split,
"min_samples_leaf": min_samples_leaf
}),
pd.DataFrame(acclist)], axis = 1).reindex())
rfcdf_Subsc.columns = ['max_depth', 'max_features', 'min_samples_leaf','min_samples_split', 'n_estimators', 'Iteration 0', 'Iteration 1', 'Iteration 2', 'Iteration 3', 'Iteration 4', 'Iteration 5', 'Iteration 6', 'Iteration 7', 'Iteration 8', 'Iteration 9', 'MeanAccuracy', 'RunTime']
pickleObject(rfcdf_Subsc, "rfcdf_Subsc", filepath = "PickleFiles/")
display(rfcdf_Subsc)
plot = rfcdf_Subsc[["Iteration 0","Iteration 1","Iteration 2","Iteration 3","Iteration 4","Iteration 5","Iteration 6","Iteration 7","Iteration 8","Iteration 9"]].transpose().plot.line(title = "Accuracies for Varying Parameters Across 10 Iterations",rot=45)
plot.set_xlabel("Iterations")
plot.set_ylabel("Accuracies")
Train Cluster Parameters for Customers
After 14 iterations of modifying the above parameters, we land on a final winner based on the highest average Accuracy value across all iterations. Average Accuracy values in our 10 test/train iterations ranged from 66.5054 % with our worst parameter inputs of the random forest classification model to a value of 73.7131 % in the best tuned model fit. Also shown below is a stacked line chart of all accuracies across the 10 iterations, with different lines per parameter input. For all inputs provided, we saw less consistency than was desired - in our winning parameter inputs, we saw an accuracy metric ranging from 63.7255 % - 81.1881 %. Part of this problem could be due to the very small sample size fit through spectral clustering. With more computational resources and time, we could have created a larger fit dataset to these clusters for training/testing operations. Given this report's focus on Clustering, and not on classification we have chosen to only train one model type. In a real-world setting, we would test various other classification methodologies to identify a model with the best consistent fit.
Parameter inputs for the final Random Forest Classification model with the KD Tree Algorithm are as follows:
Customer Parameters
| max_depth | max_features | min_samples_leaf | min_samples_split | n_estimators |
|---|---|---|---|---|
| None | auto | 4 | 8 | 25 |
%%time
#os.remove("PickleFiles/rfcdf_Cust.pkl")
if os.path.isfile("PickleFiles/rfcdf_Cust.pkl"):
print("File already created")
rfcdf_Cust = unpickleObject("rfcdf_Cust")
else:
acclist = []
n_estimators = [10 , 10 , 10 , 10 , 10 , 10 , 10 , 10 , 10 , 10 , 10 , 10 , 5 , 15 , 25 ]
max_features = ['auto', 'auto', 'auto', 8 , 12 , 16 , 20 , 22 , 'auto' , 'auto', 'auto', 'auto', 'auto', 'auto', 'auto' ]
max_depth = [None , None , None , None , None, None , None, None, 100 , 50 , 25 , 10 , None , None , None ]
min_samples_split = [2 , 8 , 12 , 8 , 8 , 8 , 8 , 8 , 8 , 8 , 8 , 8 , 8 , 8 , 8 ]
min_samples_leaf = [1 , 4 , 6 , 4 , 4 , 4 , 4 , 4 , 4 , 4 , 4 , 4 , 4 , 4 , 4 ]
for i in range(0,len(n_estimators)):
acclist.append(rfc_explor(ScaledData = CitiBike_C_Clust.drop(["usertype","Cluster_ID_Customer", "Cluster_ID_Subscriber"], axis=1),
n_estimators = n_estimators[i],
max_features = max_features[i],
max_depth = max_depth[i],
min_samples_split = min_samples_split[i],
min_samples_leaf = min_samples_leaf[i],
y = CitiBike_C_Clust["Cluster_ID_Customer"].values
)
)
rfcdf_Cust = pd.DataFrame(pd.concat([pd.DataFrame({
"n_estimators": n_estimators,
"max_features": max_features,
"max_depth": max_depth,
"min_samples_split": min_samples_split,
"min_samples_leaf": min_samples_leaf
}),
pd.DataFrame(acclist)], axis = 1).reindex())
rfcdf_Cust.columns = ['max_depth', 'max_features', 'min_samples_leaf','min_samples_split', 'n_estimators', 'Iteration 0', 'Iteration 1', 'Iteration 2', 'Iteration 3', 'Iteration 4', 'Iteration 5', 'Iteration 6', 'Iteration 7', 'Iteration 8', 'Iteration 9', 'MeanAccuracy', 'RunTime']
pickleObject(rfcdf_Cust, "rfcdf_Cust", filepath = "PickleFiles/")
display(rfcdf_Cust)
plot = rfcdf_Cust[["Iteration 0","Iteration 1","Iteration 2","Iteration 3","Iteration 4","Iteration 5","Iteration 6","Iteration 7","Iteration 8","Iteration 9"]].transpose().plot.line(title = "Accuracies for Varying Parameters Across 10 Iterations",rot=45)
plot.set_xlabel("Iterations")
plot.set_ylabel("Accuracies")
We have created a function to be used for our cross-validation Accuracy Scores. Model CLF object, original sample y values, a distinct list of classes, and a CV containing our test/train splits allow us to easily produce an array of Accuracy Scores. Finally, a confusion matrix is displayed for the last test/train iteration for further interpretation on results.
With our tuned parameters identified we may now assess futher insights. Subscribers: As was discussed earlier, we see a fairly large range in accuracies, however the average accuracy rating of 89.589 % is fairly decent across all 10 iterations. In a confusion matrix of predicted results, we find that the most difficult cluster to predict and the cluster causing the most incorrect predictions for other classes is cluster 4. You can see this clearly in the gradient plot below as we have a larger density colors on cluster 4 false positives and incorrect cluster 4 predictions in comparison to other clusters. Customers: The customer demographic had much less accurate predictions despite the decrease in number of clusters. We once again see a large range in accuracies, with an average accuracy of 73.713 %. As is seen in the confusion matrix, we pretty consistently guess each cluster correct / incorrectly, and unlike the subscriber cluster set we do not see any potential trend for incorrect predictions off the bat. Although ~73% accuracy is not ideal, it is still more than 2x better than chance(33%), so we are comfortable with moving on with this model for this report.
Once again, in a real-world setting, we would test various other classification methodologies and utilize more resources to increase sample size in order to identify stronger cluster fits and a model with the best consistent fit.
%%time
rfc_clf = RandomForestClassifier(n_estimators = 15 ,
max_features = 'auto',
max_depth = None ,
min_samples_split = 2 ,
min_samples_leaf = 1 ,
n_jobs = -1 ,
random_state = seed) # get object
rfc_clf_Subscriber, rfc_acc_Subscriber = compute_kfold_scores_Classification(clf = rfc_clf,
ScaledData = CitiBike_S_Clust.drop(["usertype","Cluster_ID_Customer", "Cluster_ID_Subscriber"], axis=1),
y = CitiBike_S_Clust["Cluster_ID_Subscriber"].values,
classes = CitiBike_S_Clust["Cluster_ID_Subscriber"].drop_duplicates().sort_values().values
)
%%time
rfc_clf = RandomForestClassifier(n_estimators = 25 ,
max_features = 'auto',
max_depth = None ,
min_samples_split = 8 ,
min_samples_leaf = 4 ,
n_jobs = -1 ,
random_state = seed) # get object
rfc_clf_Customer, rfc_acc_Customer = compute_kfold_scores_Classification(clf = rfc_clf,
ScaledData = CitiBike_C_Clust.drop(["usertype","Cluster_ID_Customer", "Cluster_ID_Subscriber"], axis=1),
y = CitiBike_C_Clust["Cluster_ID_Customer"].values,
classes = CitiBike_C_Clust["Cluster_ID_Customer"].drop_duplicates().sort_values().values
)
CitiBike_WithClusters = pd.concat([CitiBike_C, CitiBike_S])
X = CitiBike_WithClusters.drop(["usertype"], axis=1)
CitiBike_WithClusters["Cluster_ID_Customer"] = rfc_clf_Customer.predict(X)
CitiBike_WithClusters["Cluster_ID_Subscriber"] = rfc_clf_Subscriber.predict(X)
display(CitiBike_WithClusters.head())
display(CitiBike_WithClusters.tail())
Our first objective is to identify the number of components to be used for user type classification. In order to do so, we first exclude redundant and non-value variables up front. Non-value variables include gender, birth year, and age since these data were missing for most Customer user types and were replaced with filler values as discussed in previous sections. We will exclude these since they misrepresent correlation with user type.
(Note: PCA code steps adapted and modified from https://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/)
%%time
#Split Cluster_ID_Customer Values
AttSplit = pd.get_dummies(CitiBike_WithClusters.Cluster_ID_Customer,prefix='Cluster_ID_Customer')
CitiBike_WithClusters = pd.concat((CitiBike_WithClusters,AttSplit),axis=1) # add back into the dataframe
#Split Cluster_ID_Subscriber Values
AttSplit = pd.get_dummies(CitiBike_WithClusters.Cluster_ID_Subscriber,prefix='Cluster_ID_Subscriber')
CitiBike_WithClusters = pd.concat((CitiBike_WithClusters,AttSplit),axis=1) # add back into the dataframe
display(CitiBike_WithClusters.head())
myData_scaled_classification = CitiBike_WithClusters.drop(["usertype", "Cluster_ID_Customer", "Cluster_ID_Subscriber"], axis = 1).as_matrix()
print(myData_scaled_classification.shape)
maxcomp = 30
pca_class = PCA(n_components=maxcomp, svd_solver='randomized')
pca_class.fit(myData_scaled_classification)
Above, We verify that 30 attributes exist in our dataset, after removed features for clustering and the addition of our cluster values. The maximum number of components to be produced will match this number. For this reason, we will identify 30 to be the number of components produced by our PCA and will review each component's explained variance further to determine the proper number of components to be included later during model generation. Note randomized PCA was chosen in order to use singular value decomposition in our dimensionality reduction efforts due to the large size of our data set. Using full PCA required unacceptable lengths of time to compute.
Below, the resulting components have been ordered by eigenvector value and these values portrayed as ratios of variance explained by each component. In order to identify the principal components to be included during model generation, we review the rate at which explained variance decreases in significance from one principal component to the next. Accompanying these proportion values is a scree plot representing these same values in visual form. By plotting the scree plot, it is easier to judge where this rate of decreasing explained variance occurs. Interestingly, the rate of change in explained variance among principal components has two "stairs", in the scree plot. This makes interpreting the number of components slightly more challenging, as one could argue either ~PC 5 or ~PC 17 for the point in which the plot levels off to marginal decreased explained variance.
#The amount of variance that each PC explains
var= pca_class.explained_variance_ratio_
plt.rcParams['figure.figsize'] = (12, 6)
sns.set(font_scale=1)
plt.plot(range(1,maxcomp+1), var*100, marker = '.', color = 'red', markerfacecolor = 'black')
plt.xlabel('Principal Components')
plt.ylabel('Percentage of Explained Variance')
plt.title('Scree Plot')
plt.axis([0, maxcomp+1, -0.1, 20])
np.set_printoptions(suppress=True)
print(np.round(var, decimals=4)*100)
Now referring to the cumulative variance values and associated plot below, it may be seen that the cumulative variance arguably begins to plateau around the 17th principal component. 17th principal component and that the first 17 components together explain 95.75% of variance in the data set. For this reason, 17 principal components may be selected as being the most appropriate for user type classification modeling given the variables among these data.**
#Cumulative Variance explains
var1=np.cumsum(np.round(pca_class.explained_variance_ratio_, decimals=4)*100)
plt.rcParams['figure.figsize'] = (12, 6)
plt.plot(range(1,maxcomp+1), var1, marker = '.', color = 'green', markerfacecolor = 'black')
plt.xlabel('Principal Components')
plt.ylabel('Explained Variance (Sum %)')
plt.title('Cumulative Variance Plot')
plt.axis([0, maxcomp+1, 10, 101])
print(var1)
Now at the individual principal component level, each classification principal component has 30 loadings which may be esteemed as weights or coefficients representing each original attribute. These loadings represent to what extent each attribute fabricates a given principal component, and the relationship between these attributes in context of the principal component under review. Another perspective is that each principal component is describing an underlying factor which is comprised of the heaviest loadings.
Rather than discuss all 30 principal components, we will instead focus on the two principal components with what we percieve as interesting loadings in respect to cluster values (PC1, PC4) within the first 17 Principal components. Each plot below has identified the top 15 attributes, both positive and negative, which describe the component analyzed.
Principal Component 1 - Early Birds, Weekends, Mornings
We were happy to see 6 of the top 15 attributes within the first principal component, identified by our newly addeed customer / subscriber clusters. Interestingly, weather plays a big part in the relationship to these cluster values in PC1. With Temperature loadings approx. -0.51, we see negative correlations with SNOW(expected, as lower temperatures will produce a higher chance of snow) and our Subscriber/Customer Cluster 1 values. We found it interesting that we see consistent signs across the varying usertype clusters. Given the fact that Subscriber/Customer Cluster 1 are both positively signed at approx. .11, we may expect these clusters to react similarly in different weather conditions, mornings, and weekends. With negative correlation to Temperature, Mornings, and Weekend days - one might speculate this may be due to routine morning subscriber / Customer Cluster 1 routes in comparison to Subscriber Cluster 3,4 & Customer Cluster 2,3 excursions during good weather and / or weekends.
components = pd.Series(pca_class.components_[0], index=CitiBike_WithClusters.drop(["usertype", "Cluster_ID_Customer", "Cluster_ID_Subscriber"], axis = 1).columns)
maxcomponentsix = pd.Series(pd.DataFrame(abs(components).sort_values(ascending=False).head(15)).index)
matplotlib.rc('xtick', labelsize=8)
plt.rcParams['figure.figsize'] = (20, 8)
sns.set(font_scale=1.2)
weightsplot = pd.Series(components, index=maxcomponentsix)
weightsplot.plot(kind='bar', color = 'Tomato')
plt.show()
Principal Component 4 - Early Birds Vs Evening Rides
The second Principal Component we will discuss (PC4), identifies 6 of the top 15 attributes as our newly added cluster attributes. PC4 appears to rely heavily on the variance amongst ride start time. It appears as though the negative signs for Subscriber cluster 3 and Customer Clusters 3,4 match indicating their rider preference to be evening rides vs. mornings or nights. On the other side, we see that our positively signed Customer Cluster 1 and Subscriber clusters 4,5 all indicate preference towards morning and night rides over evening rides regardless of rain. This overlap, across cluster types, could be impactful towards identifying where customer behaviour resembles that of a subscriber.
components = pd.Series(pca_class.components_[3], index=CitiBike_WithClusters.drop(["usertype", "Cluster_ID_Customer", "Cluster_ID_Subscriber"], axis = 1).columns)
maxcomponentsix = pd.Series(pd.DataFrame(abs(components).sort_values(ascending=False).head(15)).index)
matplotlib.rc('xtick', labelsize=8)
plt.rcParams['figure.figsize'] = (20, 8)
sns.set(font_scale=1.2)
weightsplot = pd.Series(components, index=maxcomponentsix)
weightsplot.plot(kind='bar', color = 'SlateBlue')
plt.show()
How is this model useful? Citibike aims to increase rider subscriptions, but may fear adverts to the wrong people could keep customers from enjoying their services. A classification model predicting usertype as customer or subscriber allows Citibike to identify those customers who meet the criteria for common subscribing members. False Positive predictions are extremely valueable to target this demographic.
How does the cluster features produced impact the model? As previously discussed, we added two features to our dataset through spectral clustering. These features identified clusters based on both the subscriber and customer demographics. With the clusters identified, we were then capable of building classification models for each in order to append the cluster values to the entire dataset. Once these features were added to the dataset, running Principal Component Analysis identified several principal components which were heavily impacted by these new features. Our hopes, although not performed during this report, is that the addition of these cluster features will positively impact the accuracy of a classification model on predicting usertypes.
How would this model be deployed? This model has the ability to be deployed as real-time predictions, or as a periodic corporate marketing alert if customer email addresses are readily available. Our preference would be a real-time prediction as a customer returns a bike. Upon return, if the model suggests a customer contains subcribing tendencies email alerts, pop-up promotions,etc. may be deployed in order to gain the customer's attention towards subscriber offerings. Alternatively, the marketing team could receive this information periodically, and implement custom strategies based on industry best practice techniques to this target group. This group of individuals are likely more apt to acknowledge these tactics positively, since their usage tendencies already align more closely to that of a subcribing member.
How often would the model need to be updated? This model would definitely need to be updated periodically. As the citibike Share service grows, and new locations arrive, the model will need to be updated to account for the new sites. Also, as the population in NYC shifts over time(new businesses, schools, residential, etc.), trends may also fluctuate. These fluctuations will need to be accounted for in the model regularly to keep it up to date with current state NYC. Our recommendation for these updates would be periodic (monthly or quarterly) model fit updates in CitiBike systems to account for these possible changes.
We have discussed above, what value our model holds, our preferred method of deployment, and frequency of model updates. Below we will walk through the process for prepping a real-time prediction model for deployment, and actually executing model predictions on new data inputs.
A key component in deploying our model is the re-use of data transformation and / or Model fit objects created during this process. We need to be able to apply new data into the same constructs listed below, which were utilized in the Testing and Training process:
To do this, we must take our python objects currently stored in our active Kernel and permanently store them in a "Pickled" (.pkl) file. This .pkl file is a serializes version of our python object which can be accessed later on. To prove our pickled files are being utilized instead of previously created objects, we have executed the %reset jupyter command to clear all environment objects before proceding.
import pickle
from datetime import timedelta
def pickleObject(objectname, filename, filepath = "PickleFiles/"):
fullpicklepath = "{0}{1}.pkl".format(filepath, filename)
# Create an variable to pickle and open it in write mode
picklefile = open(fullpicklepath, 'wb')
pickle.dump(objectname, picklefile)
picklefile.close()
SubClusterCols = CitiBike_S_Clust.drop(["usertype","Cluster_ID_Customer", "Cluster_ID_Subscriber"], axis=1).columns.values.tolist()
CusClusterCols = CitiBike_C_Clust.drop(["usertype","Cluster_ID_Customer", "Cluster_ID_Subscriber"], axis=1).columns.values.tolist()
objectlist = [scaler, rfc_clf_Subscriber, rfc_clf_Customer, SubClusterCols, CusClusterCols ]
filenamelist = ["scaler", "rfc_clf_Subscriber", "rfc_clf_Customer", "SubClusterCols", "CusClusterCols" ]
for i in range(0,len(objectlist)):
pickleObject(objectlist[i], filenamelist[i])
%reset
import pickle
from datetime import timedelta
from geopy.distance import vincenty
import holidays
from datetime import datetime
from dateutil.parser import parse
import glob
import pandas as pd
import numpy as np
from IPython.display import display
import pyowm
from IPython.display import display
Below is a single data entry for a Customer upon the return of a bike rental. We have produces fake values, based on possible value ranges present in our original dataset from Citibike. For the purpose of this "real-time" prediction, the current date/time of execution is utilized for a user submitting a bike return.
%%time
NewData = pd.DataFrame({
"tripduration": [579],
"starttime": [datetime.now() + timedelta(seconds = -579)],
"stoptime": [datetime.now()],
"start_station_id": [477],
"start_station_name": ["W 41 St & 8 Ave"],
"start_station_latitude": [40.756405],
"start_station_longitude": [-73.990026],
"end_station_id": [441],
"end_station_name": ["E 52 St & 2 Ave"],
"end_station_latitude": [40.756014],
"end_station_longitude": [-73.967416],
"bikeid": [16537],
"usertype": ["Customer"],
"birthyear": ["\\N"],
"gender": [0]
})
display(NewData)
Data Transformations and Additional Features
As was done previously during Data Understanding and Preparation, there are several attributes that we computed and / or obtained from third parties to complete our dataset. The following attributes are needed to be created:
Since the current time was utilized for our bike return, we are able to utilize the openweathermap API for current weather forecasts for the current day. For this report, only the free features of this API are available. In a true production implementation, we recommend a subscription unlocking additional features for more accurate results and to increase the "Calls per Minute" limitations imposed.
%%time
## Compute LinearDistance from Start/End Lat,Long Coordinates
NewData["LinearDistance"] = NewData.apply(lambda x: vincenty((x["start_station_latitude"], x["start_station_longitude"]),
(x["end_station_latitude"], x["end_station_longitude"])).miles,
axis = 1)
## Compute DayOfWeek from Start Time
# starttime needs to be converted to a pandas datetime type before we can find the weekday name
NewData['starttime'] = pd.to_datetime(NewData['starttime'])
NewData["DayOfWeek"] = NewData['starttime'].dt.weekday_name
## Compute TimeOfDay from Start Time
##Morning 5AM-10AM
##Midday 10AM-2PM
##Afternoon 2PM-5PM
##Evening 5PM-10PM
##Night 10PM-5AM
NewData["TimeOfDay"] = np.where((NewData['starttime'].dt.hour >= 5) & (NewData['starttime'].dt.hour < 10), 'Morning',
np.where((NewData['starttime'].dt.hour >= 10) & (NewData['starttime'].dt.hour < 14), 'Midday',
np.where((NewData['starttime'].dt.hour >= 14) & (NewData['starttime'].dt.hour < 17), 'Afternoon',
np.where((NewData['starttime'].dt.hour >= 17) & (NewData['starttime'].dt.hour < 22), 'Evening',
'Night' ### ELSE case represents Night
)
)
)
)
## Compute LinearDistance from Start/End Lat,Long Coordinates
NewData["HolidayFlag"] = NewData['starttime'].isin(holidays.UnitedStates())
NewData["HolidayFlag"] = np.where(NewData["HolidayFlag"] == False, 0, 1)
%%time
owm = pyowm.OWM('462d2effa0ba127689b824b37efc9d12') # You MUST provide a valid API key
%%time
forecaster = owm.three_hours_forecast_at_coords(lat = float(NewData["start_station_latitude"]), lon = float(NewData["start_station_longitude"]))
forecast = forecaster.get_forecast()
fweather_list = forecast.get_weathers()
%%time
tlist = []
plist = []
slist = []
for x in fweather_list:
date = datetime.date(datetime.strptime(x.get_reference_time('iso'),"%Y-%m-%d %H:%M:%S+00") + timedelta(hours = -5))
if (date >= datetime.date(NewData["starttime"].min() + timedelta(days=1))) \
and (date < datetime.date(NewData["starttime"].min())+ timedelta(days = 2)):
temp = x.get_temperature('fahrenheit')['temp']
prcp = x.get_rain()
snow = x.get_snow()
tlist.append(temp)
if prcp == {}:
plist.append(0)
else:
plist.append(prcp)
if snow == {}:
slist.append(0)
else:
slist.append(snow)
tempdata = pd.DataFrame(tlist)
prcpdata = pd.DataFrame(plist)
snowdata = pd.DataFrame(slist)
NewData["TMIN"] = tempdata.min()
NewData["TAVE"] = tempdata.mean()
NewData["TMAX"] = tempdata.max()
NewData["PRCP"] = prcpdata.sum()
NewData["SNOW"] = snowdata.sum()
With additional features added, we have several missing values to scrub, the tripdurationlog attribute to compute, and data type conversions to apply to our data.
%%time
# Replace '\N' Birth Years with Zero Values
NewData["birthyear"] = NewData["birthyear"].replace(r'\N', '0')
# Convert Columns to Numerical Values
NewData[['tripduration', 'birthyear', 'LinearDistance', 'PRCP', 'SNOW', 'TAVE', 'TMAX', 'TMIN']] \
= NewData[['tripduration', 'birthyear', 'LinearDistance', 'PRCP', 'SNOW', 'TAVE', 'TMAX',
'TMIN']].apply(pd.to_numeric)
# Convert Columns to Date Values
NewData[['starttime', 'stoptime']] \
= NewData[['starttime', 'stoptime']].apply(pd.to_datetime)
# Compute Age: 0 Birth Year = 0 Age ELSE Compute Start Time Year Minus Birth Year
NewData["Age"] = np.where(NewData["birthyear"] == 0, 0,
NewData["starttime"].dt.year - NewData["birthyear"])
# Convert Columns to Str Values
NewData[['start_station_id', 'end_station_id', 'bikeid', 'HolidayFlag', 'gender']] \
= NewData[['start_station_id', 'end_station_id', 'bikeid', 'HolidayFlag', 'gender']].astype(str)
# Log Transform Column Added
NewData["tripdurationLog"] = NewData["tripduration"].apply(np.log)
display(NewData)
Encoding categorical attributes
As was performed during our data preparation section of this report, we have several categorical data values which require encoding. Since we do not have the ability to execute dummy logic as before, we manually searched for the permutation of values assigning 1 if matched and 0 if not. The below categorical attributes have been encoded below:
%%time
## DayOfWeek
NewData["DayOfWeek_Monday"] = np.where(NewData["DayOfWeek"] == "Monday", 1, 0)
NewData["DayOfWeek_Tuesday"] = np.where(NewData["DayOfWeek"] == "Tuesday", 1, 0)
NewData["DayOfWeek_Wednesday"] = np.where(NewData["DayOfWeek"] == "Wednesday", 1, 0)
NewData["DayOfWeek_Thursday"] = np.where(NewData["DayOfWeek"] == "Thursday", 1, 0)
NewData["DayOfWeek_Friday"] = np.where(NewData["DayOfWeek"] == "Friday", 1, 0)
NewData["DayOfWeek_Saturday"] = np.where(NewData["DayOfWeek"] == "Saturday", 1, 0)
NewData["DayOfWeek_Sunday"] = np.where(NewData["DayOfWeek"] == "Sunday", 1, 0)
## TimeOfDay
NewData["TimeOfDay_Morning"] = np.where(NewData["TimeOfDay"] == "Morning", 1, 0)
NewData["TimeOfDay_Midday"] = np.where(NewData["TimeOfDay"] == "Midday", 1, 0)
NewData["TimeOfDay_Afternoon"] = np.where(NewData["TimeOfDay"] == "Afternoon", 1, 0)
NewData["TimeOfDay_Evening"] = np.where(NewData["TimeOfDay"] == "Evening", 1, 0)
NewData["TimeOfDay_Night"] = np.where(NewData["TimeOfDay"] == "Night", 1, 0)
display(NewData)
The next step to deploying predictions on new data is to "unpickle" the .pkl file objects needed for our model. We need to unpickle the below constructs in order to re-produce our model, and apply it on new data:
%%time
def unpickleObject(filename, filepath = "PickleFiles/"):
fullunpicklepath = "{0}{1}.pkl".format(filepath, filename)
# Create an variable to pickle and open it in write mode
unpicklefile = open(fullunpicklepath, 'rb')
unpickleObject = pickle.load(unpicklefile)
unpicklefile.close()
return unpickleObject
%%time
unpickleObjectList = []
filenamelist = ["scaler","rfc_clf_Subscriber", "rfc_clf_Customer", "SubClusterCols", "CusClusterCols" ]
for i in range(0,len(filenamelist)):
unpickleObjectList.append(unpickleObject(filenamelist[i]))
scaler = unpickleObjectList[0]
rfc_clf_Subscriber = unpickleObjectList[1]
rfc_clf_Customer = unpickleObjectList[2]
SubClusterCols = unpickleObjectList[3]
CusClusterCols = unpickleObjectList[4]
In order to establish the data in the standard scaler form it was in during clustering, we must transform our data similarly to how it was done during training. To do this, we utilize the transform operation under the unpickled scaler object in order to acheive similar results as if this record was part of the original trained dataset.
%%time
NewData_Scaled = pd.DataFrame(scaler.transform(NewData[SubClusterCols]), columns = [SubClusterCols])
display(NewData_Scaled)
With scaled values transformed, we can create columns for "Cluster_ID_Subscriber" and "Cluster_ID_Customer" using the unpickled clf objects from the random forest classification models produced earlier.
%%time
X = NewData_Scaled[SubClusterCols]
NewData_Scaled["Cluster_ID_Subscriber"] = rfc_clf_Subscriber.predict(X)
%%time
X = NewData_Scaled[CusClusterCols]
NewData_Scaled["Cluster_ID_Customer"] = rfc_clf_Customer.predict(X)
%%time
display(NewData_Scaled.transpose())
With our clusters added to the dataset, the bike return data for this user is now finally ready to be utilzied for a classication / regression model of choice, including both Subscriber and Customer clusters. As discussed previously, given the purpose of this report and the focus on clustering - this model, and any data preparation steps for this model have not been prepared.
Note that the run-time for all steps during the transformation/addition of new features to the original input complete in under 1 second during execution. This, of course, does not include the classification / regression model, but is a realistic timeframe up to this point for real-time application.
The following items were especially challenging during this project, we felt noteworthy to mention: